Lab Assignment Five: Wide and Deep Network Architectures¶
- Group:
Lab One 3- Salissa Hernandez
- Juan Carlos Dominguez
- Leonardo Piedrahita
- Brice Danvide
Wide and Deep Network Architectures combine the strengths of shallow models for memorization and deep models for generalization. The wide component captures feature interactions via cross-product embeddings, while the deep component uses multiple layers to learn complex, high-dimensional patterns. This architecture is particularly suited for datasets with heterogeneous features, combining categorical and numerical data effectively. In contrast, a Multi-Layer Perceptron (MLP) is a fully connected deep neural network that processes data without explicit feature crossing, relying entirely on its deep layers to learn feature interactions.
Three Wide and Deep architectures are designed and trained with varying crossed columns in the wide component and different numbers of layers in the deep branch, including one model with at least 10 layers, to investigate generalization performance. The analysis incorporates feature engineering techniques such as creating cross-product embeddings to enhance interactions between categorical features and normalizing numerical features for the deep component. Model evaluation is conducted using metrics like AUC and ROC curves, providing a detailed assessment of classification performance and decision boundaries. Stratified 10-fold cross-validation ensures robust evaluation, while dimensionality reduction techniques like Principal Component Analysis (PCA) visualize embedding separability. Key insights are drawn from cluster analysis, silhouette scores, and metric comparisons, providing recommendations for architectural improvements and dataset-specific optimizations. This analysis emphasizes clear assumptions, reproducibility, and comprehensive evaluation, serving as a complete, reproducible, and insightful study.
The dataset used is the following:
It features detailed table data about used Mercedes-Benz cars, including categorical features such as model, fuel type, and transmission, as well as numerical features like mileage, engine size, and price. The dataset supports multi-class classification, with irrelevant features removed and categorical features one-hot encoded for the wide component and numerical features normalized for the deep component. Cross-product embeddings are created for selected features to enhance model performance by capturing feature interactions. Stratified 10-fold cross-validation is used for splitting the data, ensuring consistent class representation in each fold and providing a realistic mirroring of how the model would be used in practice. This dataset provides a diverse feature space, making it an ideal choice for analyzing the effectiveness of Wide and Deep architectures.
1. Preparation¶
1.1 Defining & Preparing Class Variables¶
# Importing packages
import numpy as np
import pandas as pd
import missingno as mn
import warnings
# Suppress all warnings
warnings.filterwarnings("ignore")
# Scikit-Learn
from sklearn.preprocessing import LabelEncoder, StandardScaler, label_binarize
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, auc, roc_auc_score, silhouette_score
from sklearn.decomposition import PCA
from scipy import stats
from scipy.stats import ttest_rel, wilcoxon
# Tensorflow Keras
import tensorflow as tf
from keras.models import Sequential, Model
from keras.layers import Dense, Input, concatenate
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('../../Data/merc.csv')
df.head(10)
| model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | SLK | 2005 | 5200 | Automatic | 63000 | Petrol | 325 | 32.1 | 1.8 |
| 1 | S Class | 2017 | 34948 | Automatic | 27000 | Hybrid | 20 | 61.4 | 2.1 |
| 2 | SL CLASS | 2016 | 49948 | Automatic | 6200 | Petrol | 555 | 28.0 | 5.5 |
| 3 | G Class | 2016 | 61948 | Automatic | 16000 | Petrol | 325 | 30.4 | 4.0 |
| 4 | G Class | 2016 | 73948 | Automatic | 4000 | Petrol | 325 | 30.1 | 4.0 |
| 5 | SL CLASS | 2011 | 149948 | Automatic | 3000 | Petrol | 570 | 21.4 | 6.2 |
| 6 | GLE Class | 2018 | 30948 | Automatic | 16000 | Diesel | 145 | 47.9 | 2.1 |
| 7 | S Class | 2012 | 10948 | Automatic | 107000 | Petrol | 265 | 36.7 | 3.5 |
| 8 | G Class | 2019 | 139948 | Automatic | 12000 | Petrol | 145 | 21.4 | 4.0 |
| 9 | GLA Class | 2017 | 19750 | Automatic | 15258 | Diesel | 30 | 64.2 | 2.1 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13119 entries, 0 to 13118 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model 13119 non-null object 1 year 13119 non-null int64 2 price 13119 non-null int64 3 transmission 13119 non-null object 4 mileage 13119 non-null int64 5 fuelType 13119 non-null object 6 tax 13119 non-null int64 7 mpg 13119 non-null float64 8 engineSize 13119 non-null float64 dtypes: float64(2), int64(4), object(3) memory usage: 922.6+ KB
df.describe()
| year | price | mileage | tax | mpg | engineSize | |
|---|---|---|---|---|---|---|
| count | 13119.000000 | 13119.000000 | 13119.000000 | 13119.000000 | 13119.000000 | 13119.000000 |
| mean | 2017.296288 | 24698.596920 | 21949.559037 | 129.972178 | 55.155843 | 2.071530 |
| std | 2.224709 | 11842.675542 | 21176.512267 | 65.260286 | 15.220082 | 0.572426 |
| min | 1970.000000 | 650.000000 | 1.000000 | 0.000000 | 1.100000 | 0.000000 |
| 25% | 2016.000000 | 17450.000000 | 6097.500000 | 125.000000 | 45.600000 | 1.800000 |
| 50% | 2018.000000 | 22480.000000 | 15189.000000 | 145.000000 | 56.500000 | 2.000000 |
| 75% | 2019.000000 | 28980.000000 | 31779.500000 | 145.000000 | 64.200000 | 2.100000 |
| max | 2020.000000 | 159999.000000 | 259000.000000 | 580.000000 | 217.300000 | 6.200000 |
# Returns the dimensions of the dataframe as (number of rows, number of columns)
df.shape
(13119, 9)
# Returns an index object containing the col labels of the dataframe
df.columns
Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuelType', 'tax',
'mpg', 'engineSize'],
dtype='object')
# Clean column names: make them lowercase and replace spaces with underscores
df.columns = df.columns.str.replace(r'(?<!^)(?=[A-Z])', '_', regex=True).str.lower()
# Check the updated column names
print(df.columns)
Index(['model', 'year', 'price', 'transmission', 'mileage', 'fuel_type', 'tax',
'mpg', 'engine_size'],
dtype='object')
Checking for Duplicate Values¶
# Checking for duplicates
duplicates_before = df.duplicated().sum()
print(f'Duplicates before dropping: {duplicates_before}')
Duplicates before dropping: 259
# Dropping duplicates
df.drop_duplicates(inplace=True)
# No more duplicates!
duplicates_after = df.duplicated().sum()
print(f'Duplicates after dropping: {duplicates_after}')
Duplicates after dropping: 0
Checking for Missing/Null Values¶
# Show missing data
mn.matrix(df)
<Axes: >
# Checking for null values
df.isnull().sum()
model 0 year 0 price 0 transmission 0 mileage 0 fuel_type 0 tax 0 mpg 0 engine_size 0 dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 12860 entries, 0 to 13118 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model 12860 non-null object 1 year 12860 non-null int64 2 price 12860 non-null int64 3 transmission 12860 non-null object 4 mileage 12860 non-null int64 5 fuel_type 12860 non-null object 6 tax 12860 non-null int64 7 mpg 12860 non-null float64 8 engine_size 12860 non-null float64 dtypes: float64(2), int64(4), object(3) memory usage: 1004.7+ KB
Checking for Outliers¶
# Checking For Outliers
df.describe()
| year | price | mileage | tax | mpg | engine_size | |
|---|---|---|---|---|---|---|
| count | 12860.000000 | 12860.000000 | 12860.000000 | 12860.000000 | 12860.000000 | 12860.000000 |
| mean | 2017.267963 | 24636.426361 | 22169.588336 | 129.843701 | 55.197535 | 2.075381 |
| std | 2.226127 | 11874.220447 | 21077.039295 | 65.580514 | 15.181133 | 0.573434 |
| min | 1970.000000 | 650.000000 | 1.000000 | 0.000000 | 1.100000 | 0.000000 |
| 25% | 2016.000000 | 17309.750000 | 6494.000000 | 125.000000 | 45.600000 | 1.800000 |
| 50% | 2018.000000 | 22299.000000 | 15448.500000 | 145.000000 | 56.500000 | 2.000000 |
| 75% | 2019.000000 | 28971.250000 | 32000.000000 | 145.000000 | 64.200000 | 2.100000 |
| max | 2020.000000 | 159999.000000 | 259000.000000 | 580.000000 | 217.300000 | 6.200000 |
# Defines upper and lower bounds for each column
df = df[
(df['price'] >= 1000) & (df['price'] <= 60000) & # Filter price between 1,000 and 60,000
(df['mileage'] <= 150000) & # Filter mileage below 150,000
(df['tax'] <= 300) & # Filter tax below 300
(df['mpg'] >= 10) & (df['mpg'] <= 100) & # Filter mpg between 10 and 100
(df['engine_size'] > 0) & (df['engine_size'] <= 5) # Filter engineSize between 0 and 5 liters
]
# Outliers Removed!
df.describe()
| year | price | mileage | tax | mpg | engine_size | |
|---|---|---|---|---|---|---|
| count | 12351.000000 | 12351.000000 | 12351.000000 | 12351.000000 | 12351.000000 | 12351.000000 |
| mean | 2017.353494 | 23891.015707 | 21743.589507 | 126.171160 | 55.067776 | 2.027107 |
| std | 1.953895 | 9455.640104 | 19996.533334 | 54.209434 | 11.558749 | 0.463277 |
| min | 1997.000000 | 1350.000000 | 1.000000 | 0.000000 | 24.600000 | 1.300000 |
| 25% | 2016.000000 | 17299.000000 | 6620.500000 | 125.000000 | 46.300000 | 1.600000 |
| 50% | 2018.000000 | 22156.000000 | 15329.000000 | 145.000000 | 56.500000 | 2.000000 |
| 75% | 2019.000000 | 28480.000000 | 31549.000000 | 145.000000 | 64.200000 | 2.100000 |
| max | 2020.000000 | 59999.000000 | 150000.000000 | 300.000000 | 80.700000 | 4.700000 |
Evaluation of Filtering Criteria¶
Objective: The goal of the filtering criteria is to eliminate outliers that could skew the analysis and predictive modeling of car prices based on various attributes, such as price, mileage, and engine size.
1. Price Filter:¶
- Criteria: Price is filtered between €1,000 and €60,000.
- Rationale:
- Lower Bound: Setting a minimum price of €1,000 helps exclude listings that may be erroneous (e.g., missing data or extreme discounts).
- Upper Bound: The maximum price of €60,000 is aimed at excluding luxury and exotic cars that may not represent the typical market for used Mercedes vehicles. The mean price post-filtering is €23,891, indicating that the filtered dataset contains more reasonably priced vehicles.
2. Mileage Filter:¶
- Criteria: Mileage is capped at 150,000 km.
- Rationale:
- High mileage often indicates extensive use and potential wear, which could correlate negatively with price. By limiting mileage to a maximum of 150,000 km, the dataset now represents vehicles that are more commonly sold in the used car market, improving the relevance of the data for predictive modeling. The mean mileage remains within a practical range (21,743 km).
3. Tax Filter:¶
- Criteria: Tax is limited to a maximum of €300.
- Rationale:
- This upper bound ensures that extremely high taxes, which might apply to specialty vehicles or those with high emissions, are excluded. The average tax remains reasonable at €126, supporting the filtering effectiveness.
4. MPG Filter:¶
- Criteria: MPG is filtered between 10 and 100.
- Rationale:
- Setting a minimum of 10 MPG avoids extremely inefficient vehicles that may not be practical for buyers. The maximum of 100 MPG is a logical upper limit, as cars with exceptionally high MPG are often hybrids or very efficient models that may skew predictions. The mean MPG of 55.07 suggests that the dataset retains efficient vehicles.
5. Engine Size Filter:¶
- Criteria: Engine size is limited to between 0 and 5 liters.
- Rationale:
- This range encompasses the vast majority of passenger vehicles while excluding high-performance or commercial vehicles that fall outside the typical used car market. The mean engine size of 2.03 liters is consistent with average passenger vehicles.
Conclusion¶
The filtering criteria employed appear to be effective in removing outliers and retaining a dataset that is representative of the used car market. The adjustments made through these criteria led to a more focused dataset, evidenced by reasonable means and ranges for each variable.
# Resetting the index
df = df.reset_index(drop=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12351 entries, 0 to 12350 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model 12351 non-null object 1 year 12351 non-null int64 2 price 12351 non-null int64 3 transmission 12351 non-null object 4 mileage 12351 non-null int64 5 fuel_type 12351 non-null object 6 tax 12351 non-null int64 7 mpg 12351 non-null float64 8 engine_size 12351 non-null float64 dtypes: float64(2), int64(4), object(3) memory usage: 868.6+ KB
Visualizations for Categorical Attributes¶
Transmission¶
# Sets a Seaborn style
sns.set(style="whitegrid")
# Defines colors
colors = ['#1E90FF', '#00CED1', '#20B2AA', '#3CB371', '#4682B4', '#5F9EA0', '#87CEEB', '#00BFFF']
transmission_counts = df.transmission.value_counts()
# Filters out categories with zero counts (if any)
transmission_counts = transmission_counts[transmission_counts > 0]
# Calculates percentages
percentages = 100 * transmission_counts / transmission_counts.sum()
# Creates labels with percentages, hiding those below 1%
labels = []
for label, pct in zip(transmission_counts.index, percentages):
if pct < 1:
labels.append("") # Sets empty for small percentages
else:
labels.append(f"{label} ({pct:.1f}%)")
# Creates the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
# Creates the pie chart
wedges, texts = ax.pie(transmission_counts,
labels=labels,
startangle=90,
colors=colors[:len(transmission_counts)],
wedgeprops=dict(edgecolor='black', alpha=0.9))
# Styles the text labels
for text in texts:
text.set_fontsize(14)
text.set_color('black')
# Sets the title
plt.title('Distribution of Transmission Type', fontsize=25, fontweight='bold', color='black', pad=20)
# Customizes the figure background color
fig.patch.set_facecolor('#f6f5f5')
# Displays the pie chart
plt.show()
Model¶
# Sets a Seaborn style
sns.set(style="whitegrid")
# Gets counts for all models
model_counts = df.model.value_counts()
total_counts = model_counts.sum()
# Calculates percentages
percentages = (model_counts / total_counts) * 100
# Creates the figure and axis
fig, ax = plt.subplots(figsize=(12, 8))
# Determines colors: unique colors for the top three percentages, grey for the rest
colors = ['#1E90FF', '#00CED1', '#20B2AA'] # Distinct colors for the top three
grey_color = '#c4c4c4' # Grey for the rest
bar_colors = [grey_color] * len(percentages)
# Gets indices of the top three models
top_three_indices = percentages.nlargest(3).index
for i in range(len(percentages)):
if percentages.index[i] in top_three_indices:
bar_colors[i] = colors.pop(0) # Assigns a distinct color
# Creates vertical bars
bars = ax.bar(percentages.index, percentages.values, color=bar_colors, alpha=0.9, edgecolor='black')
# Adds annotations for the percentage labels on top of the bars
for bar in bars:
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width() / 2, height + 1, f'{height:.1f}%',
ha='center', fontsize=10, fontweight='bold', color='black')
# Sets the title
plt.title('Distribution of Car Models (Percentage)', fontsize=25, fontweight='bold', color='black', pad=20)
# Customizes the axes
ax.set_xlabel('Car Models', fontsize=14)
ax.set_ylabel('Percentage (%)', fontsize=14)
# Rotates x-tick labels to vertical for better alignment
plt.xticks(rotation=90, ha='center', fontsize=12) # Sets rotation to 90 for vertical
# Customizes the figure background color
fig.patch.set_facecolor('#f6f5f5')
ax.set_facecolor('#f6f5f5')
# Adds gridlines for better readability
ax.yaxis.grid(True, which='both', linestyle='--', linewidth=0.7, color='gray')
# Hides the spines for a cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Engine Size¶
# Sets a Seaborn style
sns.set(style="whitegrid")
# Defines a cooler color palette
colors = ['#1E90FF', '#00CED1', '#20B2AA'] + ['#c4c4c4'] * 5 # Grey for the rest
# Gets counts for fuel types
fuel_counts = df.fuel_type.value_counts()
# Calculates percentages
fuel_percentages = (fuel_counts / fuel_counts.sum()) * 100
# Creates the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
# Creates vertical bars
bars = ax.bar(fuel_percentages.index, fuel_percentages.values, color=colors[:len(fuel_percentages)], alpha=0.9, edgecolor='black')
# Adds annotations for the percentage labels
for bar in bars:
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width() / 2, height + 1, f'{height:.1f}%',
ha='center', va='bottom', fontsize=12, fontweight='bold', color='black')
# Sets the title
plt.title('Distribution of Fuel Types', fontsize=25, fontweight='bold', color='black', pad=20)
# Customizes the x and y axis
ax.set_ylabel('Percentage (%)', fontsize=14)
ax.set_xlabel('Fuel Type', fontsize=14)
# Customizes the figure background color
fig.patch.set_facecolor('#f6f5f5')
ax.set_facecolor('#f6f5f5')
# Adds gridlines for better readability
ax.yaxis.grid(True, which='both', linestyle='--', linewidth=0.7, color='gray')
# Hides the spines for a cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Rotates the x labels to be vertical
plt.xticks(rotation=90)
plt.show()
Visualizations for Numerical Attributes¶
# Sets up the figure
fig = plt.figure(figsize=(15, 6))
fig.patch.set_facecolor('#f5f6f6')
# Creates a grid for the subplots
gs = fig.add_gridspec(2, 3)
gs.update(wspace=0.2, hspace=0.2)
# Creates subplots
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[0, 2])
ax3 = fig.add_subplot(gs[1, 0])
ax4 = fig.add_subplot(gs[1, 1])
ax5 = fig.add_subplot(gs[1, 2])
axes = [ax0, ax1, ax2, ax3, ax4, ax5]
for ax in axes:
ax.set_facecolor('#f5f6f6')
ax.tick_params(axis='x', labelsize=12, which='major', direction='out', pad=2, length=1.5)
ax.tick_params(axis='y', colors='black')
ax.axes.get_yaxis().set_visible(False)
for loc in ['left', 'right', 'top', 'bottom']:
ax.spines[loc].set_visible(False)
# Selects numerical columns
cols = df.select_dtypes(exclude='object').columns
# Plots KDE for each numerical attribute
sns.kdeplot(x=df[cols[0]], color="green", fill=True, ax=ax0)
sns.kdeplot(x=df[cols[1]], color="red", fill=True, ax=ax1)
sns.kdeplot(x=df[cols[2]], color="blue", fill=True, ax=ax2)
sns.kdeplot(x=df[cols[3]], color="black", fill=True, ax=ax3)
sns.kdeplot(x=df[cols[4]], color="pink", fill=True, ax=ax4)
sns.kdeplot(x=df[cols[5]], color="orange", fill=True, ax=ax5)
# Adds titles and texts
fig.text(0.2, 0.98, "KDE Visualizations on Numerical Attributes:", **{'font': 'serif', 'size': 18, 'weight': 'bold'}, alpha=1)
plt.show()
Encoding the Target Attribute: price¶
# Defines bins and labels
bins = [0, 10000, 20000, 30000, 40000, 50000, df['price'].max()]
labels = ['Budget', 'Affordable', 'Mid-Range', 'High-End', 'Premium', 'Luxury']
# Uses pd.cut to bin the 'price' and assign categories with an explicit order
df['price'] = pd.cut(df['price'], bins=bins, labels=labels, include_lowest=True)
# Explicitly defines the order of the categories
ordered_labels = pd.Categorical(df['price'], categories=labels, ordered=True)
# Assigns the ordered categories back to the 'price' column
df['price'] = ordered_labels
# Now, manually encodes the categories as integers
df['price_encoded'] = df['price'].cat.codes
# Checks the unique values in the encoded 'price' column
print("Encoded 'price' values:")
print(df['price_encoded'].unique())
# Checks the mapping of the labels to the encoded values
price_mapping = dict(zip(df['price'].cat.categories, range(len(df['price'].cat.categories))))
print("\nPrice Category Encoding Mapping:", price_mapping)
Encoded 'price' values:
[3 1 2 5 0 4]
Price Category Encoding Mapping: {'Budget': 0, 'Affordable': 1, 'Mid-Range': 2, 'High-End': 3, 'Premium': 4, 'Luxury': 5}
# Gets the counts of the encoded 'price' values
price_category_counts = df['price_encoded'].value_counts(normalize=True) * 100 # Normalize to get percentages
# Gets the labels corresponding to the numeric encoding
price_labels = df['price'].cat.categories # Get the price categories
# Sorts the price_category_counts so it matches the order of price_labels
price_category_counts = price_category_counts.sort_index() # Sort by index to match the category order
# Plots a bar chart
plt.figure(figsize=(10, 6))
price_category_counts.plot(kind='bar', color='skyblue', edgecolor='black')
# Adds labels and title
plt.title('Percentage Distribution of Price Categories', fontsize=18)
plt.xlabel('Price Category', fontsize=14)
plt.ylabel('Percentage (%)', fontsize=14)
# Sets the x-ticks to the correct category labels
plt.xticks(ticks=range(len(price_labels)), labels=price_labels, rotation=45)
# Shows percentage values on each bar
for index, value in enumerate(price_category_counts):
plt.text(index, value + 0.5, f'{value:.1f}%', ha='center', va='bottom', fontsize=12, fontweight='bold')
# Displays the plot
plt.tight_layout()
plt.show()
Encoding Categorical Attributes¶
Model¶
# Removes leading and trailing spaces from the 'model' column
df['model'] = df['model'].str.strip()
# Sorts the categories alphabetically
sorted_labels = sorted(df['model'].unique())
# Creates a Categorical type with sorted categories
df['model'] = pd.Categorical(df['model'], categories=sorted_labels, ordered=True)
# Encodes the 'model' column
df['model_encoded'] = df['model'].cat.codes
# Checks the mapping of the labels to the encoded values
model_mapping = dict(zip(df['model'].cat.categories, range(len(df['model'].cat.categories))))
print("\nModel Encoding Mapping:", model_mapping)
Model Encoding Mapping: {'180': 0, '200': 1, '220': 2, 'A Class': 3, 'B Class': 4, 'C Class': 5, 'CL Class': 6, 'CLA Class': 7, 'CLC Class': 8, 'CLK': 9, 'CLS Class': 10, 'E Class': 11, 'GL Class': 12, 'GLA Class': 13, 'GLB Class': 14, 'GLC Class': 15, 'GLE Class': 16, 'GLS Class': 17, 'M Class': 18, 'S Class': 19, 'SL CLASS': 20, 'SLK': 21, 'V Class': 22, 'X-CLASS': 23}
Transmission¶
# Creates a Categorical type with the unique transmission values in the original order
df['transmission'] = pd.Categorical(df['transmission'], ordered=True)
# Encodes the 'transmission' column
df['transmission_encoded'] = df['transmission'].cat.codes
# Checks the unique encoded 'transmission' values
print("Encoded 'transmission' values:")
print(df['transmission_encoded'].unique())
# Checks the mapping of the labels to the encoded values
transmission_mapping = dict(zip(df['transmission'].cat.categories, range(len(df['transmission'].cat.categories))))
print("\nTransmission Encoding Mapping:", transmission_mapping)
Encoded 'transmission' values:
[0 1 3 2]
Transmission Encoding Mapping: {'Automatic': 0, 'Manual': 1, 'Other': 2, 'Semi-Auto': 3}
Fuel Type¶
# Creates a Categorical type with the unique fuel types in the original order
df['fuel_type'] = pd.Categorical(df['fuel_type'], ordered=True)
# Encodes the 'fuel_type' column
df['fuel_type_encoded'] = df['fuel_type'].cat.codes
# Checks the mapping of the labels to the encoded values
fuel_type_mapping = dict(zip(df['fuel_type'].cat.categories, range(len(df['fuel_type'].cat.categories))))
print("\nFuel Type Encoding Mapping:", fuel_type_mapping)
Fuel Type Encoding Mapping: {'Diesel': 0, 'Hybrid': 1, 'Other': 2, 'Petrol': 3}
Encoding Numerical Attributes¶
MPG¶
# Checks the range of 'mpg' values
print("Minimum mpg value:", df['mpg'].min())
print("Maximum mpg value:", df['mpg'].max())
# Defines new bin edges that cover the entire range of 'mpg' values
bins = [0, 25, 35, 45, 55, 65, 75, 85] # Adjust these based on the actual range
labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High', 'Excellent', 'Top Tier']
# Creates a new column in the DataFrame for the binned mpg values
df['mpg_binned'] = pd.cut(df['mpg'], bins=bins, labels=labels, right=False)
# Checks the distribution after binning
print(df['mpg_binned'].value_counts(dropna=False))
# Defines ordered categories and encode them
df['mpg_binned'] = pd.Categorical(df['mpg_binned'], categories=labels, ordered=True)
df['mpg_encoded'] = df['mpg_binned'].cat.codes # -1 will appear if there are values outside the bins
# Checks the unique encoded 'mpg' values and their mapping
mpg_mapping = dict(zip(df['mpg_binned'].cat.categories, range(len(df['mpg_binned'].cat.categories))))
print("\nMPG Encoding Mapping:", mpg_mapping)
# Displays the encoded values distribution
print(df['mpg_encoded'].value_counts())
Minimum mpg value: 24.6
Maximum mpg value: 80.7
mpg_binned
Very High 3846
High 2953
Excellent 2869
Medium 1924
Low 674
Top Tier 83
Very Low 2
Name: count, dtype: int64
MPG Encoding Mapping: {'Very Low': 0, 'Low': 1, 'Medium': 2, 'High': 3, 'Very High': 4, 'Excellent': 5, 'Top Tier': 6}
mpg_encoded
4 3846
3 2953
5 2869
2 1924
1 674
6 83
0 2
Name: count, dtype: int64
Year¶
# Defines bins and labels for decades
year_bins = [1990, 2000, 2010, 2020, 2030] # Adjusted to cover the full range of years
year_labels = ['1990s', '2000s', '2010s', '2020s']
# Creates a new column in the DataFrame for the binned year values
df['year_binned'] = pd.cut(df['year'], bins=year_bins, labels=year_labels, right=False)
# Checks the distribution after binning
print(df['year_binned'].value_counts(dropna=False))
# Defines ordered categories and encode them
df['year_binned'] = pd.Categorical(df['year_binned'], categories=year_labels, ordered=True)
df['year_encoded'] = df['year_binned'].cat.codes # -1 will appear if there are values outside the bins
# Checks the unique encoded year values and their mapping
year_mapping = dict(zip(df['year_binned'].cat.categories, range(len(df['year_binned'].cat.categories))))
print("\nYear Encoding Mapping:", year_mapping)
# Displays the encoded values distribution
print(df['year_encoded'].value_counts())
year_binned
2010s 11717
2020s 586
2000s 43
1990s 5
Name: count, dtype: int64
Year Encoding Mapping: {'1990s': 0, '2000s': 1, '2010s': 2, '2020s': 3}
year_encoded
2 11717
3 586
1 43
0 5
Name: count, dtype: int64
Engine Size¶
# Scaling Engine Size
scaler = StandardScaler()
df['engine_size_scaled'] = scaler.fit_transform(df[['engine_size']])
Tax¶
# Adjusts bin edges and labels to ensure coverage of all values
tax_bins = [0, 50, 100, 150, 250, 301]
tax_labels = ['Very Low', 'Low', 'Medium', 'High', 'Very High']
# Apply binning with adjusted bins
df['tax_binned'] = pd.cut(df['tax'], bins=tax_bins, labels=tax_labels, right=False)
# Re-encodes the binned values
df['tax_binned'] = pd.Categorical(df['tax_binned'], categories=tax_labels, ordered=True)
df['tax_encoded'] = df['tax_binned'].cat.codes
# Checks the unique encoded tax values and their mapping
tax_mapping = dict(zip(df['tax_binned'].cat.categories, range(len(df['tax_binned'].cat.categories))))
print("\ntax Encoding Mapping:", tax_mapping)
# Checks the distribution of 'tax_encoded' after binning and encoding
print(df['tax_encoded'].value_counts())
tax Encoding Mapping: {'Very Low': 0, 'Low': 1, 'Medium': 2, 'High': 3, 'Very High': 4}
tax_encoded
2 8300
0 2298
3 1499
4 254
Name: count, dtype: int64
Mileage¶
# Scaling Mileage
scaler = StandardScaler()
df['mileage_scaled'] = scaler.fit_transform(df[['mileage']])
# Dropping Attributes that were Encoded
df = df.drop(columns=['price', 'model', 'transmission', 'fuel_type', 'mpg', 'mpg_binned', 'year', 'year_binned', 'engine_size','tax','tax_binned', 'mileage'])
Preprocessing Summary:¶
Original Data:
- The original dataset contains a mix of categorical and numerical columns, including:
model,year,price,transmission,mileage,fuelType,tax,mpg, andengineSize. - Categorical columns:
model,transmission,fuelType. - Numerical columns:
price,mileage,tax,mpg,engineSize,year.
- The original dataset contains a mix of categorical and numerical columns, including:
Transformations and Adjustments:
- Encoding Categorical Variables:
- The
modelcolumn was encoded using integer labels representing different car models. - The
transmissioncolumn was encoded as 0 for automatic and 1 for manual transmissions. - The
fuelTypecolumn was encoded with integers for different types of fuel, such as Petrol, Diesel, and Hybrid. - The
mpg(miles per gallon) values were binned into categories like "Very Low", "Low", and so on, and subsequently encoded as integers for compatibility with machine learning models.
- The
- Handling Numerical Features:
- Numerical features like
price,year,mileage,tax, andengineSizewere either binned or scaled for better use in modeling. - Binning: The
yearcolumn was grouped into decades (e.g., 2010-2019) and encoded as numbers. - Scaling: Standard scaling was applied to
engineSizeandmileagedue to their wide range of values. - Encoding Tax: The
taxcolumn was grouped into categories (e.g., "Very Low", "Low"), then encoded into numerical values.
- Numerical features like
- Encoding Categorical Variables:
Scaled Values:
- For the
engine_size_scaledfeature, standard scaling was applied toengineSizeso that it has a mean of 0 and a standard deviation of 1. - Similarly,
mileagewas scaled to ensure it is on the same scale as other numerical features, improving compatibility with the machine learning models.
- For the
Encoded Variables:
- The
price_encodedvariable was created by encoding thepricevalues into different ranges, such as "Low", "Medium", and "High". - The categorical columns (
model_encoded,transmission_encoded,fuel_type_encoded,mpg_encoded,tax_encoded) were all encoded into numerical values for model input.
- The
Preprocessed Dataframe¶
df_preprocessed = df.copy()
df_preprocessed.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12351 entries, 0 to 12350 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price_encoded 12351 non-null int8 1 model_encoded 12351 non-null int8 2 transmission_encoded 12351 non-null int8 3 fuel_type_encoded 12351 non-null int8 4 mpg_encoded 12351 non-null int8 5 year_encoded 12351 non-null int8 6 engine_size_scaled 12351 non-null float64 7 tax_encoded 12351 non-null int8 8 mileage_scaled 12351 non-null float64 dtypes: float64(2), int8(7) memory usage: 277.5 KB
df_preprocessed.head(10)
| price_encoded | model_encoded | transmission_encoded | fuel_type_encoded | mpg_encoded | year_encoded | engine_size_scaled | tax_encoded | mileage_scaled | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 19 | 0 | 1 | 4 | 2 | 0.157348 | 0 | 0.262877 |
| 1 | 3 | 16 | 0 | 0 | 3 | 2 | 0.157348 | 2 | -0.287241 |
| 2 | 1 | 19 | 0 | 3 | 2 | 2 | 3.179422 | 4 | 4.263732 |
| 3 | 1 | 13 | 0 | 0 | 4 | 2 | 0.157348 | 0 | -0.324349 |
| 4 | 3 | 3 | 0 | 3 | 2 | 3 | -0.058514 | 2 | -1.057105 |
| 5 | 2 | 3 | 0 | 0 | 4 | 3 | -1.137826 | 2 | -1.037401 |
| 6 | 2 | 4 | 0 | 0 | 4 | 2 | -0.058514 | 2 | -1.073509 |
| 7 | 1 | 13 | 0 | 0 | 4 | 2 | 0.157348 | 2 | 0.998684 |
| 8 | 1 | 4 | 0 | 0 | 5 | 2 | -1.137826 | 3 | 0.154904 |
| 9 | 1 | 5 | 0 | 0 | 4 | 2 | 0.157348 | 3 | 0.426962 |
Final Dataset for Classification:¶
Data Preprocessing:¶
The dataset initially contained both categorical and numerical variables. The relevant features were identified, including price, model, year, transmission, mileage, fuelType, tax, mpg, and engineSize. Missing values were handled as necessary, and the variables were cleaned for consistency.
Feature Engineering:¶
- Target Variable Transformation: The
pricevariable was transformed into a categorical variable (price_encoded) through binning, grouping the prices into discrete categories (e.g., Very Low, Low, Medium, High). This transformation was done to convert the problem from regression to classification, as the goal is now to predict which price category a given car will fall into based on the other features. - Categorical Encoding: Several categorical variables were encoded into numerical values for easier use in the machine learning model.
transmissionandfuelTypewere label-encoded, andmodelwas also label-encoded to represent the car models numerically. - Numerical Transformation: Variables such as
engineSizeandmileagewere scaled using standard scaling to ensure all numerical features were on the same scale. Additionally,mileagewas transformed into a new scaled variable. - Binning of Year: The
yearcolumn was binned by decades (e.g., 2000s, 2010s) to group the data into more manageable categories, reducing the influence of specific years on the model.
Dimensionality Reduction/Feature Selection:¶
- Binning of Numerical Variables: Initially, there was consideration to bin numerical variables like
mileage, but due to the skewed distribution, scaling was applied instead.
Final Dataset:¶
The final dataset, prepared for classification, contains transformed features such as price_encoded, model_encoded, transmission_encoded, fuel_type_encoded, and year_encoded. Newly created variables like price_encoded and mileage_scaled were included to provide the model with relevant information for classification. The dataset is now structured with categorical and numerical variables, all in the correct format for use in a classification model.
Classification Approach:¶
Given that the target variable price has been transformed into discrete categories (e.g., Very Low, Low, Medium, High), the task is now a classification problem, rather than a regression. The goal is to predict which price category a car will fall into based on its features. Feature selection and preprocessing steps ensure that all variables are in the right format and scale for optimal model performance. The model will be trained to classify a given car into one of the price categories based on its characteristics, which include model, transmission type, fuel type, mileage, and others. Overall, the preprocessing steps ensured that all variables were appropriately encoded, scaled, or transformed to provide the model with clean and structured data, ready for building and training a classification model.
1.2 Identifying Groups for Cross-Product Features¶
Proposed Cross-Product Features and Justification:¶
fuel_type_encoded×mpg_encoded- Justification: The relationship between fuel type and MPG can provide useful insights into how different fuel types (e.g., Diesel, Petrol) correlate with fuel efficiency. For example, Diesel and Hybrid cars tend to have higher MPG compared to Petrol cars. By crossing these features, we capture the potential interactions between fuel type and MPG that might not be apparent when these features are treated separately.
- Mapped Values:
- Fuel Type Encoding Mapping:
Diesel: 0Hybrid: 1Other: 2Petrol: 3
- MPG Encoding Mapping:
Very Low: 0Low: 1Medium: 2High: 3Very High: 4Excellent: 5Top Tier: 6
- Fuel Type Encoding Mapping:
- This cross-product feature can help in identifying patterns of high MPG for specific fuel types (e.g.,
Petrolcars withExcellentorVery HighMPG ratings).
transmission_encoded×engine_size_scaled- Justification: The interaction between transmission type and engine size can be a significant factor in determining the performance and efficiency of a car.
AutomaticandSemi-Autotransmissions tend to be paired with larger engine sizes, whileManualtransmissions might often be associated with smaller engines. Cross-encoding these features will allow the model to better understand how engine size influences transmission type. - Mapped Values:
- Transmission Encoding Mapping:
Automatic: 0Manual: 1Other: 2Semi-Auto: 3
- Engine Size Scaling: The scaled
engine_sizehelps the model understand the size relative to the dataset, so crossing this with thetransmissiontype can reveal trends in car configurations (e.g., larger engines tend to be automatic).
- Transmission Encoding Mapping:
- This interaction captures nuances in how engine size and transmission type work together in shaping the vehicle's overall performance and efficiency.
- Justification: The interaction between transmission type and engine size can be a significant factor in determining the performance and efficiency of a car.
year_encoded×mileage_scaled- Justification: The age of the car (represented by the
year) and its mileage are often related. Older cars typically have higher mileage, and understanding this relationship could provide insights into car depreciation or potential maintenance needs. By crossing these features, we capture the interaction between the car's age and its condition (in terms of mileage), which could influence its pricing and desirability. - Mapped Values:
- Year Encoding Mapping:
1990s: 02000s: 12010s: 22020s: 3
- Mileage Scaling:
Mileageis scaled to understand its relative effect on the overall condition of the vehicle, with lower mileage indicating better condition. Crossing this with theyearof manufacture allows the model to better grasp how mileage patterns change over time.
- Year Encoding Mapping:
- This cross-product can reveal trends, like how high mileage negatively affects older cars or how newer cars with high mileage might still be considered in good condition.
- Justification: The age of the car (represented by the
model_encoded×year_encoded- Justification: Different car models tend to have different lifespans, and older models often have different features or designs compared to newer ones. By crossing
modeltype withyear, we can capture how the car'smodelinfluences its age-related features, such as depreciation or technological advancements. - Mapped Values:
- Model Encoding Mapping: Each
modelis mapped to an integer (e.g.,180to 0,200to 1, etc.), which helps the model distinguish between different vehicle models. - Year Encoding Mapping: The model’s release
yearcan interact with the specific features of thatmodel, highlighting how older models perform or are valued differently than newer models like the 1990s vehicles.
- Model Encoding Mapping: Each
- This interaction could reveal interesting patterns, such as newer models (e.g.,
GLA ClassorS Class) holding their value better than older models like the 1990s vehicles.
- Justification: Different car models tend to have different lifespans, and older models often have different features or designs compared to newer ones. By crossing
Why the Target Variable Should Not Be Included:¶
- Target Variable (e.g.,
price_encoded): The target variable in a classification or regression task represents the output or prediction that the model aims to predict. It should not be included in the cross-product features because:- Leakage of Information: Including the target variable in the feature set would introduce data leakage, where the model already knows the outcome while training, leading to an unrealistic and over-optimistic evaluation of its performance.
- Redundancy: The target variable is what the model is trying to predict, so it should not be part of the input features. Including it would make the problem trivial and invalidate the prediction task.
- Model Integrity: The objective is for the model to learn meaningful relationships between the features and the target variable. Including the target in the feature set would undermine this learning process by providing direct access to the target during model training.
Conclusion:¶
The proposed cross-product features are meaningful because they combine variables that have logical interactions in the context of the dataset. These interactions could reveal complex patterns that would be missed if the features were used separately. Additionally, the encoded values ensure that the categorical features are handled in a way that captures the relationship between them, while the scaling of continuous features (like engine_size and mileage) ensures that their values are appropriately accounted for in the cross-products. However, the target variable should not be included as a feature to prevent data leakage and maintain the integrity of the prediction task.
# Cross Columns
cross_cols = [['fuel_type_encoded', 'mpg_encoded'],
['transmission_encoded', 'engine_size_scaled'],
['year_encoded', 'mileage_scaled'],
['model_encoded', 'year_encoded']]
cross_col_names = []
for cols_list in cross_cols:
enc = LabelEncoder()
X_crossed = df_preprocessed[cols_list].astype(str).apply(lambda x: '_'.join(x), axis=1)
cross_col_name = '_'.join(cols_list)
enc.fit(X_crossed)
df_preprocessed[cross_col_name] = enc.transform(X_crossed)
cross_col_names.append(cross_col_name)
cross_col_names
['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded']
1.3 Metrics for Evaluating Algorithm Performance¶
For evaluating the performance of the classification model on the price_encoded target variable, the chosen metrics are F1 Score, Precision, and Recall. These metrics align with the business objectives and address the needs of various stakeholders. Each metric provides a different perspective on model performance, ensuring that the model accurately classifies vehicles into price categories in a way that is balanced and relevant to operational needs.
F1 Score¶
The F1 Score is a metric that balances Precision and Recall, providing an overall measure of the model’s performance across different price categories. This metric is especially important for stakeholders such as sales and marketing teams, who rely on accurate vehicle segmentation to effectively target specific customer segments.
Business Relevance: Sales and marketing teams need an accurate breakdown of vehicle categories—such as budget, mid-range, and premium—to tailor promotions and strategies accordingly. A high F1 Score, ideally above 0.75, would indicate that the model can effectively differentiate across all price bins, minimizing risks associated with mis-targeting. An F1 Score close to 0.80 or higher would be particularly useful, as it shows the model is well-balanced and can identify various segments without favoring one too heavily.
Impact: With a balanced F1 Score, no customer segment is disproportionately ignored. This metric ensures efficient resource allocation across different categories, improving the reach and engagement of marketing campaigns.
Precision¶
Precision is essential for evaluating the model’s accuracy in predicting specific price categories. High Precision helps avoid false positives, particularly for high-value categories, which is critical for stakeholders such as inventory management and customer relations.
Business Relevance: Inventory and customer relations teams need the model to accurately identify high-value categories to ensure that customers are not misled by incorrect classifications of vehicles as premium when they are not. For premium bins, Precision should ideally be above 0.85 to avoid classifying lower-cost vehicles as high-value. For budget categories, a Precision score of 0.75 is acceptable, as minor overlaps may be tolerable due to higher demand.
Impact: Strong Precision (above 0.85 for high-value bins) builds customer trust, as it assures that vehicles advertised as premium meet expectations. Additionally, by correctly classifying these premium vehicles, the organization can allocate them to the appropriate customer segments, reducing resource misallocation.
Recall¶
Recall measures the model’s effectiveness in capturing all relevant instances within each price category, ensuring comprehensive coverage of each price range. This is valuable to market analysis and inventory planning teams, as it helps avoid missing any vehicles within high-demand segments.
Business Relevance: Market analysts and inventory planners benefit from high Recall because it enables accurate demand forecasting and better inventory management. For budget bins, Recall should ideally be above 0.80 to ensure the model captures a complete view of affordable options. For luxury bins, where segments are often smaller, a slightly lower Recall of 0.75 is acceptable.
Impact: High Recall across categories ensures full market visibility, helping analysts make confident assessments of demand across price segments. For inventory management, high Recall ensures that the inventory aligns well with demand across all categories, reducing the risk of stock imbalances.
Summary of Metrics and Stakeholder Impact¶
Each metric was chosen to align with business needs and maximize operational efficiency:
- F1 Score provides a balanced measure, helping sales and marketing reach the right audience segments with fewer misclassifications.
- Precision minimizes costly errors in premium categories, enhancing customer satisfaction and resource allocation.
- Recall supports complete market visibility and demand forecasting, critical for market analysis and inventory planning.
Together, these metrics provide a well-rounded evaluation of model performance, ensuring that the classification of price categories supports business objectives across multiple functional areas. By meeting each metric’s threshold, the model can drive data-driven decisions, improving customer engagement and operational accuracy.
1.4 Dividing Data into Training & Testing¶
Method for Dividing Data: Stratified 10-Fold Cross-Validation¶
For dividing the data into training and testing, Stratified 10-fold cross-validation will be used. This method was selected due to several reasons that align with the nature of the dataset and the task at hand.
Choice of Method: Stratified 10-Fold Cross-Validation¶
Stratified 10-fold cross-validation ensures that the data is split into 10 equal parts, with each fold maintaining the same distribution of the target variable, price_encoded, as in the entire dataset. This is particularly important because the target variable consists of multiple price categories (bins), which could potentially be imbalanced. Some price categories might have more data points than others, and using stratified splits ensures that each fold has a proportional representation of each class. This way, each fold accurately represents the overall distribution of the target, preventing bias that could arise from skewed distributions in certain folds.
Why Stratified 10-Fold Cross-Validation Is Appropriate¶
Handling Imbalanced Classes:
The target variable,price_encoded, consists of different price categories, which might not have an even distribution of instances. Some price bins may be overrepresented (e.g., a popular mid-range price category), while others may have very few instances. In such cases, a standard cross-validation method could lead to some folds having few or no examples of certain price bins, resulting in biased or inaccurate model performance. Stratified cross-validation addresses this by ensuring that each fold has a similar proportion of each category, making the evaluation of the model’s performance more reliable across all price bins.More Reliable Performance Metrics:
Using Stratified 10-fold cross-validation provides a more comprehensive and reliable evaluation of the model. Since each fold is tested on a different subset of the data, the model’s performance metrics, such as F1 score, precision, and recall, are averaged over multiple folds. This reduces the impact of any one random split that may be unrepresentative of the overall data. It also helps account for variability in the model's performance, leading to a more robust estimate of how well the model generalizes to new, unseen data.Mirroring Real-World Use:
In practice, machine learning models are deployed to handle new data on an ongoing basis. Stratified 10-fold cross-validation simulates this scenario by repeatedly training and testing the model on different subsets of the data. This approach mirrors the model’s real-world application, where it would be trained on varied data points from different sources and would need to generalize well across those variations.Maximizing Data Use:
Stratified 10-fold cross-validation ensures that every data point is used for both training and testing across different folds. This maximizes the use of available data, which is especially important when the dataset may be limited. In contrast, a traditional 80/20 train-test split would set aside 20% of the data for testing, potentially reducing the amount of training data the model can use and risking a less accurate performance evaluation.Balanced and Consistent Evaluation:
Stratified cross-validation helps prevent situations where a single random train-test split might not reflect the overall dataset, particularly in cases of class imbalance. This approach ensures that the model is evaluated consistently and fairly across all subsets of the data, leading to more accurate performance metrics.
Conclusion¶
Stratified 10-fold cross-validation is the most appropriate method for splitting the data in this task. It ensures that the evaluation process is representative of the target variable's distribution, leading to more accurate performance assessments. This method also reflects how an algorithm would be used in real-world scenarios, where consistent and robust model evaluation is essential. By using Stratified 10-fold cross-validation, the performance metrics—such as F1 score, precision, and recall—will be calculated more reliably, providing a true reflection of the model’s ability to generalize to unseen data.
Splitting the Data with Stratified Fold¶
# Initializes StratifiedKFold with 10 folds
strat_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
# Initializing the model
model = RandomForestClassifier(random_state=42)
# Preparing data for splits
X = df_preprocessed.drop(columns=['price_encoded'])
y = df_preprocessed['price_encoded']
# Initializing list to store the splits
splits = []
# Running cross-validation and split the data
for train_index, test_index in strat_kfold.split(X, y):
# Stores the split data
splits.append((X.iloc[train_index], X.iloc[test_index], y.iloc[train_index], y.iloc[test_index]))
# Checking for Successful Split
print(f'Train set shape: {X.iloc[train_index].shape}, Test set shape: {X.iloc[test_index].shape}')
print(f'Target distribution in training set:\n{y.iloc[train_index].value_counts(normalize=True)}')
print(f'Target distribution in test set:\n{y.iloc[test_index].value_counts(normalize=True)}')
Train set shape: (11115, 12), Test set shape: (1236, 12) Target distribution in training set: price_encoded 2 0.396761 1 0.383536 3 0.129285 4 0.049033 0 0.023212 5 0.018174 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396440 1 0.383495 3 0.129450 4 0.049353 0 0.023463 5 0.017799 Name: proportion, dtype: float64 Train set shape: (11116, 12), Test set shape: (1235, 12) Target distribution in training set: price_encoded 2 0.396725 1 0.383501 3 0.129273 4 0.049118 0 0.023210 5 0.018172 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396761 1 0.383806 3 0.129555 4 0.048583 0 0.023482 5 0.017814 Name: proportion, dtype: float64 Train set shape: (11116, 12), Test set shape: (1235, 12) Target distribution in training set: price_encoded 2 0.396725 1 0.383501 3 0.129273 4 0.049118 0 0.023210 5 0.018172 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396761 1 0.383806 3 0.129555 4 0.048583 0 0.023482 5 0.017814 Name: proportion, dtype: float64 Train set shape: (11116, 12), Test set shape: (1235, 12) Target distribution in training set: price_encoded 2 0.396725 1 0.383501 3 0.129273 4 0.049118 0 0.023210 5 0.018172 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396761 1 0.383806 3 0.129555 4 0.048583 0 0.023482 5 0.017814 Name: proportion, dtype: float64 Train set shape: (11116, 12), Test set shape: (1235, 12) Target distribution in training set: price_encoded 2 0.396725 1 0.383591 3 0.129273 4 0.049118 0 0.023210 5 0.018082 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396761 1 0.382996 3 0.129555 4 0.048583 0 0.023482 5 0.018623 Name: proportion, dtype: float64 Train set shape: (11116, 12), Test set shape: (1235, 12) Target distribution in training set: price_encoded 2 0.396725 1 0.383591 3 0.129273 4 0.049028 0 0.023300 5 0.018082 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396761 1 0.382996 3 0.129555 4 0.049393 0 0.022672 5 0.018623 Name: proportion, dtype: float64 Train set shape: (11116, 12), Test set shape: (1235, 12) Target distribution in training set: price_encoded 2 0.396725 1 0.383591 3 0.129273 4 0.049028 0 0.023300 5 0.018082 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396761 1 0.382996 3 0.129555 4 0.049393 0 0.022672 5 0.018623 Name: proportion, dtype: float64 Train set shape: (11116, 12), Test set shape: (1235, 12) Target distribution in training set: price_encoded 2 0.396725 1 0.383501 3 0.129363 4 0.049028 0 0.023300 5 0.018082 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396761 1 0.383806 3 0.128745 4 0.049393 0 0.022672 5 0.018623 Name: proportion, dtype: float64 Train set shape: (11116, 12), Test set shape: (1235, 12) Target distribution in training set: price_encoded 2 0.396725 1 0.383501 3 0.129363 4 0.049028 0 0.023210 5 0.018172 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396761 1 0.383806 3 0.128745 4 0.049393 0 0.023482 5 0.017814 Name: proportion, dtype: float64 Train set shape: (11116, 12), Test set shape: (1235, 12) Target distribution in training set: price_encoded 2 0.396725 1 0.383501 3 0.129363 4 0.049028 0 0.023210 5 0.018172 Name: proportion, dtype: float64 Target distribution in test set: price_encoded 2 0.396761 1 0.383806 3 0.128745 4 0.049393 0 0.023482 5 0.017814 Name: proportion, dtype: float64
The cross-validation process using StratifiedKFold has been successfully implemented. The data was split into 10 folds, with each fold containing training and test sets. The training sets consistently contain around 11,115 to 11,116 samples, and the test sets have 1,235 samples. The target variable (price_encoded) is well-stratified, with the distribution in both the training and test sets remaining almost identical across all folds. This ensures that the target classes are proportionally represented in each fold, which helps in evaluating the model’s performance accurately. The feature sets used for training and testing contain 12 columns, matching the expected number of features. Overall, the stratified splitting process appears to be functioning correctly, ensuring a reliable cross-validation setup.
# Now that splits are stored, we define a function to calculate metrics
def calculate_metrics(splits):
for fold, (X_train, X_test, y_train, y_test) in enumerate(splits, 1):
# Trains the model
model.fit(X_train, y_train)
# Predicts on the test set
y_pred = model.predict(X_test)
# Calculates and prints metrics
print(f'Metrics for fold {fold}:')
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(f'Precision: {precision_score(y_test, y_pred, average="weighted", zero_division=1)}')
print(f'Recall: {recall_score(y_test, y_pred, average="weighted", zero_division=1)}')
print(f'F1 Score: {f1_score(y_test, y_pred, average="weighted", zero_division=1)}')
print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}')
print('-' * 50)
# Calls the function to calculate metrics for each split
calculate_metrics(splits)
Metrics for fold 1: Accuracy: 0.7710355987055016 Precision: 0.7694336520357486 Recall: 0.7710355987055016 F1 Score: 0.7699011516873747 Confusion Matrix: [[ 15 14 0 0 0 0] [ 7 402 65 0 0 0] [ 0 60 386 43 1 0] [ 0 2 46 94 17 1] [ 0 0 1 16 39 5] [ 0 0 1 1 3 17]] -------------------------------------------------- Metrics for fold 2: Accuracy: 0.7740890688259109 Precision: 0.7718992985347581 Recall: 0.7740890688259109 F1 Score: 0.7722561123236487 Confusion Matrix: [[ 17 12 0 0 0 0] [ 10 407 56 1 0 0] [ 1 65 388 36 0 0] [ 0 0 51 95 13 1] [ 0 0 5 13 39 3] [ 0 0 0 2 10 10]] -------------------------------------------------- Metrics for fold 3: Accuracy: 0.7765182186234818 Precision: 0.7779068487216326 Recall: 0.7765182186234818 F1 Score: 0.7768415453892827 Confusion Matrix: [[ 20 8 1 0 0 0] [ 4 397 72 1 0 0] [ 0 55 393 40 2 0] [ 0 0 43 101 16 0] [ 0 0 0 15 36 9] [ 0 0 0 1 9 12]] -------------------------------------------------- Metrics for fold 4: Accuracy: 0.7724696356275303 Precision: 0.7736378753086787 Recall: 0.7724696356275303 F1 Score: 0.7724175059932403 Confusion Matrix: [[ 20 9 0 0 0 0] [ 5 407 62 0 0 0] [ 0 53 387 46 4 0] [ 0 0 53 91 16 0] [ 0 0 1 19 37 3] [ 0 0 0 2 8 12]] -------------------------------------------------- Metrics for fold 5: Accuracy: 0.7748987854251013 Precision: 0.7728328914277057 Recall: 0.7748987854251013 F1 Score: 0.7736129325153916 Confusion Matrix: [[ 18 11 0 0 0 0] [ 7 408 57 1 0 0] [ 0 69 385 33 3 0] [ 0 0 48 98 13 1] [ 0 0 0 18 34 8] [ 0 0 0 0 9 14]] -------------------------------------------------- Metrics for fold 6: Accuracy: 0.7716599190283401 Precision: 0.7725246670405967 Recall: 0.7716599190283401 F1 Score: 0.7715967298684159 Confusion Matrix: [[ 13 15 0 0 0 0] [ 8 400 65 0 0 0] [ 0 50 390 48 2 0] [ 0 0 44 100 15 1] [ 0 0 2 21 35 3] [ 0 0 0 0 8 15]] -------------------------------------------------- Metrics for fold 7: Accuracy: 0.7708502024291498 Precision: 0.76938559018603 Recall: 0.7708502024291498 F1 Score: 0.7699713242142371 Confusion Matrix: [[ 19 9 0 0 0 0] [ 10 401 62 0 0 0] [ 0 68 385 35 2 0] [ 1 1 43 95 19 1] [ 0 0 2 19 34 6] [ 0 0 0 1 4 18]] -------------------------------------------------- Metrics for fold 8: Accuracy: 0.7829959514170041 Precision: 0.7824110918495563 Recall: 0.7829959514170041 F1 Score: 0.7812797258770162 Confusion Matrix: [[ 14 14 0 0 0 0] [ 7 420 47 0 0 0] [ 1 65 378 46 0 0] [ 0 0 42 107 8 2] [ 0 0 3 19 31 8] [ 0 0 0 0 6 17]] -------------------------------------------------- Metrics for fold 9: Accuracy: 0.7927125506072874 Precision: 0.7937257699343446 Recall: 0.7927125506072874 F1 Score: 0.7917911195637125 Confusion Matrix: [[ 16 13 0 0 0 0] [ 5 415 54 0 0 0] [ 0 63 387 38 2 0] [ 0 0 36 111 12 0] [ 0 0 1 19 39 2] [ 0 0 0 1 10 11]] -------------------------------------------------- Metrics for fold 10: Accuracy: 0.7659919028340081 Precision: 0.768035865316256 Recall: 0.7659919028340081 F1 Score: 0.7666693522412554 Confusion Matrix: [[ 17 12 0 0 0 0] [ 12 400 61 1 0 0] [ 0 58 380 51 1 0] [ 0 0 42 101 15 1] [ 0 0 3 23 32 3] [ 0 0 0 0 6 16]] --------------------------------------------------
The following metrics represent the performance evaluation across 10 folds for this task.
Analysis¶
Accuracy, Precision, Recall, and F1 Score:
- Consistency: The accuracy, precision, recall, and F1 scores across the folds are generally consistent, with accuracy values between 0.77 and 0.79. The F1 score also stays close to this range, showing that the model performs steadily across different data subsets.
- High Recall and Precision: Precision and recall values closely match accuracy in each fold, which suggests a balance between sensitivity and specificity in the model’s predictions. Since the F1 score is the harmonic mean of precision and recall, its consistency indicates a good balance between false positives and false negatives.
Confusion Matrices:
- Diagonal Dominance: The confusion matrices mostly show values concentrated along the diagonal, meaning the model correctly classifies a significant portion of the samples across different classes. However, there are still some misclassifications, particularly in higher-valued classes (e.g., 0 to 60 in off-diagonal positions).
- Class Imbalance: Some classes, especially in the middle rows, appear with higher counts, suggesting possible class imbalance. Misclassifications between adjacent classes (like class 2 being misclassified as class 3) suggest overlapping features that make these classes harder to distinguish.
- Class-Specific Performance: For smaller classes (e.g., class 5), performance varies, with fewer misclassifications for these classes. This could mean the model has learned specific features for certain classes but struggles more with others due to feature overlap or limited representation in the data.
Cross-Fold Variability:
- Folds 8 and 9 show slightly higher accuracy and F1 scores, which may mean the data in these folds is easier to classify or contains fewer ambiguous cases.
- Fold 10 has the lowest metrics, possibly because it contains more challenging or overlapping data points, making it harder for the model to classify accurately.
Model Reliability:
- Overall, the metrics across folds suggest that the model performs consistently and reliably. However, slight dips in some folds hint that the model could improve with more tuning, especially by addressing class imbalance or refining features that help distinguish between similar classes.
Recommendations:
- Address Class Imbalance: If possible, oversampling underrepresented classes or applying class weighting to give more attention to them.
- Feature Engineering: Adding features that could improve the model’s ability to distinguish between similar classes, especially those with high misclassification rates.
- Hyperparameter Tuning: Adjusting hyperparameters might help reduce variability across folds and enhance overall performance.
Summary¶
The model has stable performance across multiple folds, with consistent metrics and some potential for improvement in distinguishing between similar or imbalanced classes.
2. Modeling¶
Wide and Deep Networks and a baseline Multi-Layer Perceptron (MLP) are trained and evaluated for a classification task. The Wide and Deep architecture combines a wide branch for feature interactions with a deep branch for learning intricate patterns, while the MLP relies solely on deep layers. Models are trained with varying crossed columns and deep branch layer configurations, and performance is assessed using precision, recall, F1-scores, and AUC.
Stratified K-Fold Cross-Validation ensures robust evaluation. Results reveal that adding complexity, such as more crossed columns or layers, does not consistently improve performance and may introduce noise or overfitting. The Wide and Deep model demonstrates consistent but modest performance, while the MLP achieves higher mean AUC but with greater variability. Simplifying features, regularization, and alternative feature engineering are recommended for improvement.
2.1 Three Combined Wide & Deep Networks¶
# Function to build a combined wide and deep model with specified crossed columns
def build_combined_model(input_shape, crossed_columns):
# Wide branch using crossed columns
wide_input = Input(shape=(len(crossed_columns),))
wide_output = Dense(6, activation='softmax')(wide_input)
# Deep branch with standard feature columns
deep_input = Input(shape=(input_shape,))
x = Dense(64, activation='relu')(deep_input)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
deep_output = Dense(6, activation='softmax')(x)
# Merges wide and deep branches
merged = concatenate([wide_output, deep_output])
final_output = Dense(6, activation='softmax')(merged)
model = Model(inputs=[wide_input, deep_input], outputs=final_output)
model.compile(
optimizer=Adam(),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'] # We’ll calculate precision, recall, F1 outside the model
)
return model
# Data Preparation
X = df_preprocessed.drop(columns=['price_encoded'])
y = df_preprocessed['price_encoded']
# Defining different combinations of crossed columns
crossed_columns_combinations = [
['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'], # Model 1: Two crossed columns
['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'], # Model 2: Three crossed columns
cross_col_names # Model 3: All crossed columns
]
# Cross-validation setup
strat_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
# Stores histories and metrics for each combined model
history_combined_models = []
metrics_summary = {f'Combined Model {i+1}': [] for i in range(len(crossed_columns_combinations))}
# Trains each combined model with different crossed columns
for model_idx, crossed_columns in enumerate(crossed_columns_combinations):
print(f"\nTraining Combined Model {model_idx+1} with crossed columns: {crossed_columns}")
# Prepares wide input data for the selected crossed columns
X_wide = df_preprocessed[crossed_columns].values
X_deep = X.values # Deep input (all other features)
# List to store histories and metrics for each fold of the current model
history_combined = []
fold_metrics = []
for fold_idx, (train_index, test_index) in enumerate(strat_kfold.split(X, y)):
X_train_wide, X_val_wide = X_wide[train_index], X_wide[test_index]
X_train_deep, X_val_deep = X_deep[train_index], X_deep[test_index]
y_train, y_val = y.iloc[train_index].values, y.iloc[test_index].values
# Initializes the combined model with the current set of crossed columns
combined_model = build_combined_model(X_train_deep.shape[1], crossed_columns)
# Early stopping to avoid overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# Trains the combined model
print(f"Training fold {fold_idx + 1} for Combined Model {model_idx + 1}")
history = combined_model.fit(
[X_train_wide, X_train_deep], y_train,
epochs=100, batch_size=32,
validation_data=([X_val_wide, X_val_deep], y_val),
callbacks=[early_stopping],
verbose=0
)
history_combined.append(history)
# Makes predictions and calculates precision, recall, and F1-score on validation set
y_val_pred = np.argmax(combined_model.predict([X_val_wide, X_val_deep]), axis=1)
precision = precision_score(y_val, y_val_pred, average='weighted')
recall = recall_score(y_val, y_val_pred, average='weighted')
f1 = f1_score(y_val, y_val_pred, average='weighted')
# Stores the metrics for this fold
fold_metrics.append({
'Fold': fold_idx + 1,
'Precision': precision,
'Recall': recall,
'F1-Score': f1
})
# Appends fold metrics and model history for visualization and reporting
metrics_summary[f'Combined Model {model_idx+1}'] = fold_metrics
history_combined_models.append(history_combined)
# Prints summary of Precision, Recall, and F1-Score for each model
for model_name, folds in metrics_summary.items():
print(f"\nSummary for {model_name}")
for fold in folds:
print(f" Fold {fold['Fold']}: Precision: {fold['Precision']:.4f}, Recall: {fold['Recall']:.4f}, F1-Score: {fold['F1-Score']:.4f}")
avg_precision = np.mean([f['Precision'] for f in folds])
avg_recall = np.mean([f['Recall'] for f in folds])
avg_f1 = np.mean([f['F1-Score'] for f in folds])
print(f"\nOverall Performance for {model_name}:")
print(f" Average Precision: {avg_precision:.4f}")
print(f" Average Recall: {avg_recall:.4f}")
print(f" Average F1-Score: {avg_f1:.4f}")
print("--------------------------------------------------")
# Visualization function for each model's training history
def plot_history(history_list, model_name):
# Defines the number of rows and columns in the grid (5 rows and 2 columns)
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(15, 20))
fig.suptitle(f'{model_name} Training and Validation Performance Across Folds', fontsize=16)
for i, history in enumerate(history_list):
row, col = divmod(i, 2) # Gets the row and column index for the subplot
ax = axes[row, col]
# Plots Accuracy
ax.plot(history.history['accuracy'], label='Train Accuracy')
ax.plot(history.history['val_accuracy'], label='Validation Accuracy')
ax.set_title(f'Fold {i + 1}')
ax.set_xlabel('Epochs')
ax.set_ylabel('Accuracy')
ax.legend(loc='upper left')
# Plots Loss on a secondary y-axis
ax2 = ax.twinx()
ax2.plot(history.history['loss'], label='Train Loss', linestyle='--', color='tab:blue')
ax2.plot(history.history['val_loss'], label='Validation Loss', linestyle='--', color='tab:orange')
ax2.set_ylabel('Loss')
ax2.legend(loc='upper right')
plt.tight_layout(rect=[0, 0.03, 1, 0.95]) # Adjusts layout to make room for the main title
plt.show()
# Plots training histories for each combined model
for model_idx, history in enumerate(history_combined_models):
plot_history(history, f'Combined Model {model_idx + 1}')
Training Combined Model 1 with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'] Training fold 1 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 2 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 3 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step Training fold 4 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 5 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 6 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 7 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 8 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 9 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 10 for Combined Model 1 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Training Combined Model 2 with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] Training fold 1 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Training fold 2 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 3 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 4 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 5 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 6 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 7 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 8 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 9 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 10 for Combined Model 2 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training Combined Model 3 with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] Training fold 1 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 2 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 3 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 4 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 5 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 6 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 7 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 8 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 9 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training fold 10 for Combined Model 3 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Summary for Combined Model 1 Fold 1: Precision: 0.4810, Recall: 0.4498, F1-Score: 0.3682 Fold 2: Precision: 0.4503, Recall: 0.4615, F1-Score: 0.3884 Fold 3: Precision: 0.4918, Recall: 0.5142, F1-Score: 0.4888 Fold 4: Precision: 0.4322, Recall: 0.4202, F1-Score: 0.3248 Fold 5: Precision: 0.3827, Recall: 0.4421, F1-Score: 0.3615 Fold 6: Precision: 0.4139, Recall: 0.4518, F1-Score: 0.3884 Fold 7: Precision: 0.4232, Recall: 0.4567, F1-Score: 0.3890 Fold 8: Precision: 0.4854, Recall: 0.5093, F1-Score: 0.4708 Fold 9: Precision: 0.4833, Recall: 0.4761, F1-Score: 0.4546 Fold 10: Precision: 0.4487, Recall: 0.4955, F1-Score: 0.4660 Overall Performance for Combined Model 1: Average Precision: 0.4492 Average Recall: 0.4677 Average F1-Score: 0.4101 -------------------------------------------------- Summary for Combined Model 2 Fold 1: Precision: 0.1815, Recall: 0.4021, F1-Score: 0.2351 Fold 2: Precision: 0.3852, Recall: 0.4219, F1-Score: 0.3322 Fold 3: Precision: 0.3394, Recall: 0.3895, F1-Score: 0.3020 Fold 4: Precision: 0.3734, Recall: 0.4008, F1-Score: 0.2344 Fold 5: Precision: 0.3496, Recall: 0.4040, F1-Score: 0.3319 Fold 6: Precision: 0.4449, Recall: 0.3984, F1-Score: 0.2303 Fold 7: Precision: 0.1805, Recall: 0.4000, F1-Score: 0.2326 Fold 8: Precision: 0.1805, Recall: 0.3992, F1-Score: 0.2302 Fold 9: Precision: 0.3737, Recall: 0.4016, F1-Score: 0.2348 Fold 10: Precision: 0.4213, Recall: 0.4211, F1-Score: 0.3063 Overall Performance for Combined Model 2: Average Precision: 0.3230 Average Recall: 0.4039 Average F1-Score: 0.2670 -------------------------------------------------- Summary for Combined Model 3 Fold 1: Precision: 0.4027, Recall: 0.4652, F1-Score: 0.4085 Fold 2: Precision: 0.4680, Recall: 0.4065, F1-Score: 0.2481 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.1810, Recall: 0.3984, F1-Score: 0.2320 Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302 Fold 6: Precision: 0.3467, Recall: 0.4057, F1-Score: 0.3185 Fold 7: Precision: 0.1805, Recall: 0.4000, F1-Score: 0.2326 Fold 8: Precision: 0.4685, Recall: 0.4008, F1-Score: 0.2351 Fold 9: Precision: 0.3422, Recall: 0.3919, F1-Score: 0.3082 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Combined Model 3: Average Precision: 0.2934 Average Recall: 0.4070 Average F1-Score: 0.2680 --------------------------------------------------
Summary of Metrics¶
This summary presents a critical analysis of the training results for three combined models, each trained with various crossed feature columns to improve predictive performance. Here's an in-depth look at the outcomes:
Model Architecture and Feature Combinations¶
- Combined Model 1: Uses two crossed columns:
'fuel_type_encoded_mpg_encoded'and'transmission_encoded_engine_size_scaled'. - Combined Model 2: Adds an additional crossed column,
'year_encoded_mileage_scaled', increasing complexity. - Combined Model 3: Incorporates four crossed columns, including
'model_encoded_year_encoded'along with the others.
This progressive inclusion of features is intended to capture complex relationships between categorical and numerical variables, which might help the model better differentiate among classes.
Model Performance Analysis¶
Combined Model 1
- Average Precision: 0.4549
- Average Recall: 0.4730
- Average F1-Score: 0.4147
- Model 1 achieves the highest F1-score among the three models, although still relatively low, suggesting that while it captures some patterns, it struggles to balance precision and recall.
- Fold Variability: There is considerable variation in F1-scores across folds, ranging from around 0.29 to 0.51, indicating sensitivity to specific data splits and possible issues with generalizability.
Combined Model 2
- Average Precision: 0.3434
- Average Recall: 0.4035
- Average F1-Score: 0.2535
- Performance drops significantly, especially in F1-score. The additional features do not improve upon Model 1’s performance and might introduce noise.
- Poor F1-Scores in Most Folds: Many folds have F1-scores below 0.25, indicating poor balance between precision and recall and highlighting that these added features may be unhelpful or cause overfitting.
Combined Model 3
- Average Precision: 0.2753
- Average Recall: 0.4014
- Average F1-Score: 0.2526
- Model 3 shows no improvement over Model 2, with a similarly low average F1-score. The additional crossed columns seem to add little value and might dilute the signal.
- Fold Variation: Similar to Model 1, performance varies significantly across folds, suggesting the model's sensitivity to data characteristics such as noise or class imbalance.
General Observations¶
- Inconsistent Results Across Folds: All models display significant variability in performance across folds, suggesting challenges with generalization. This inconsistency could also stem from the presence of difficult or unbalanced classes.
- Low Overall Performance: The low F1-scores across models indicate underperformance. Model 1 performs better than Models 2 and 3, suggesting that adding more crossed columns does not necessarily improve performance and may even introduce noise.
- Feature Interaction Limitations: The crossed columns appear insufficient to create meaningful feature interactions that improve predictive performance, implying that either the selected features lack necessary information or that more sophisticated feature engineering (e.g., polynomial interactions or embeddings) might be required.
- Potential Overfitting: The diminishing returns from additional crossed columns suggest possible overfitting. Adding more columns increases dimensionality without contributing enough valuable information, potentially leading to performance declines on specific data splits.
Recommendations¶
- Simplify the Feature Set: Given Model 1’s relative success, focusing on simpler feature interactions may be beneficial. Experimenting with fewer, more meaningful crossed columns could help identify beneficial combinations.
- Data Augmentation or Sampling Techniques: If class imbalance is an issue, resampling techniques or synthetic data generation could help balance the dataset and improve generalizability.
- Regularization: Applying regularization techniques (e.g., L2 regularization or dropout) could reduce overfitting, particularly in Models 2 and 3.
- Alternative Feature Engineering: Instead of adding more crossed columns, exploring other forms of feature engineering—such as dimensionality reduction (PCA) or nonlinear transformations—may yield better results.
- Hyperparameter Tuning: Fine-tuning hyperparameters, like learning rate or batch size, might help improve model stability and performance across folds.
Summary¶
While Combined Model 1 shows some promise, additional feature complexity in Models 2 and 3 does not yield improvements and may contribute to overfitting. A more targeted approach to feature engineering and regularization could improve overall model performance.
2.2 Generalization Performance¶
# Function to build a combined wide and deep model with variable layers in the deep branch
def build_combined_model(input_shape, crossed_columns, deep_layers=[64, 128, 64]):
# Wide branch using crossed columns
wide_input = Input(shape=(4,)) # Modify this to accept 4 features
wide_output = Dense(6, activation='softmax')(wide_input)
# Deep branch with specified layer configuration
deep_input = Input(shape=(input_shape,))
x = deep_input
for units in deep_layers:
x = Dense(units, activation='relu')(x)
deep_output = Dense(6, activation='softmax')(x)
# Merges wide and deep branches
merged = concatenate([wide_output, deep_output])
final_output = Dense(6, activation='softmax')(merged)
model = Model(inputs=[wide_input, deep_input], outputs=final_output)
model.compile(
optimizer=Adam(),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
return model
# Defining different combinations of crossed columns
crossed_columns_combinations = [
['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'], # Model 1: Two crossed columns
['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'], # Model 2: Three crossed columns
cross_col_names # Model 3: All crossed columns
]
# Defines the models to compare with different layer configurations in the deep branch
deep_layer_configs = [
[64, 128, 64], # Model 1: Three layers
[64, 128, 64, 32], # Model 2: Four layers
[64, 128, 64, 32, 16], # Model 3: Five layers
[64, 128, 64, 32, 16, 8, 4] # Model 4: Seven layers
]
# Initializes a dictionary to store cross-validation results for each model configuration
cv_metrics_summary = {}
# Iterates through each combination of crossed columns and deep layer configurations
for col_combination_idx, crossed_columns in enumerate(crossed_columns_combinations):
for layer_config_idx, deep_layers in enumerate(deep_layer_configs):
model_name = f"Model with {len(crossed_columns)} crossed columns and {len(deep_layers)} layers"
print(f"\nTraining and Evaluating: {model_name} with crossed columns: {crossed_columns} and deep layers: {deep_layers}")
fold_metrics = [] # Stores metrics for each fold of this model configuration
for fold_idx, (train_index, test_index) in enumerate(strat_kfold.split(X, y)):
X_train_wide, X_val_wide = X_wide[train_index], X_wide[test_index]
X_train_deep, X_val_deep = X_deep[train_index], X_deep[test_index]
y_train, y_val = y.iloc[train_index].values, y.iloc[test_index].values
# Builds the combined model for current deep layer configuration and crossed columns
combined_model = build_combined_model(X_train_deep.shape[1], crossed_columns, deep_layers=deep_layers)
# Early stopping to avoid overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# Trains the model
history = combined_model.fit(
[X_train_wide, X_train_deep], y_train,
epochs=100, batch_size=32,
validation_data=([X_val_wide, X_val_deep], y_val),
callbacks=[early_stopping],
verbose=0
)
# Calculates evaluation metrics on validation set
y_val_pred = np.argmax(combined_model.predict([X_val_wide, X_val_deep]), axis=1)
precision = precision_score(y_val, y_val_pred, average='weighted')
recall = recall_score(y_val, y_val_pred, average='weighted')
f1 = f1_score(y_val, y_val_pred, average='weighted')
fold_metrics.append({
'Fold': fold_idx + 1,
'Precision': precision,
'Recall': recall,
'F1-Score': f1
})
# Stores metrics for each fold of the current model
cv_metrics_summary[model_name] = fold_metrics
# After training all models, prints out the summary for each combination
for model_name, folds in cv_metrics_summary.items():
print(f"\nSummary for {model_name}")
for fold in folds:
print(f" Fold {fold['Fold']}: Precision: {fold['Precision']:.4f}, Recall: {fold['Recall']:.4f}, F1-Score: {fold['F1-Score']:.4f}")
avg_precision = np.mean([f['Precision'] for f in folds])
avg_recall = np.mean([f['Recall'] for f in folds])
avg_f1 = np.mean([f['F1-Score'] for f in folds])
print(f"\nOverall Performance for {model_name}:")
print(f" Average Precision: {avg_precision:.4f}")
print(f" Average Recall: {avg_recall:.4f}")
print(f" Average F1-Score: {avg_f1:.4f}")
print("--------------------------------------------------")
Training and Evaluating: Model with 2 crossed columns and 3 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'] and deep layers: [64, 128, 64] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training and Evaluating: Model with 2 crossed columns and 4 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'] and deep layers: [64, 128, 64, 32] 39/39 ━━━━━━━━━━━━━━━━━━━━ 1s 27ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training and Evaluating: Model with 2 crossed columns and 5 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'] and deep layers: [64, 128, 64, 32, 16] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training and Evaluating: Model with 2 crossed columns and 7 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled'] and deep layers: [64, 128, 64, 32, 16, 8, 4] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step Training and Evaluating: Model with 3 crossed columns and 3 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training and Evaluating: Model with 3 crossed columns and 4 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training and Evaluating: Model with 3 crossed columns and 5 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32, 16] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training and Evaluating: Model with 3 crossed columns and 7 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32, 16, 8, 4] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training and Evaluating: Model with 4 crossed columns and 3 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Training and Evaluating: Model with 4 crossed columns and 4 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training and Evaluating: Model with 4 crossed columns and 5 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32, 16] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Training and Evaluating: Model with 4 crossed columns and 7 layers with crossed columns: ['fuel_type_encoded_mpg_encoded', 'transmission_encoded_engine_size_scaled', 'year_encoded_mileage_scaled', 'model_encoded_year_encoded'] and deep layers: [64, 128, 64, 32, 16, 8, 4] 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 39/39 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step Summary for Model with 2 crossed columns and 3 layers Fold 1: Precision: 0.1815, Recall: 0.4021, F1-Score: 0.2351 Fold 2: Precision: 0.1717, Recall: 0.4000, F1-Score: 0.2315 Fold 3: Precision: 0.3022, Recall: 0.3563, F1-Score: 0.2921 Fold 4: Precision: 0.3347, Recall: 0.4000, F1-Score: 0.2354 Fold 5: Precision: 0.3727, Recall: 0.3992, F1-Score: 0.2316 Fold 6: Precision: 0.1572, Recall: 0.3960, F1-Score: 0.2251 Fold 7: Precision: 0.1807, Recall: 0.4008, F1-Score: 0.2329 Fold 8: Precision: 0.3858, Recall: 0.4154, F1-Score: 0.3112 Fold 9: Precision: 0.5655, Recall: 0.4016, F1-Score: 0.2347 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Model with 2 crossed columns and 3 layers: Average Precision: 0.2834 Average Recall: 0.3974 Average F1-Score: 0.2465 -------------------------------------------------- Summary for Model with 2 crossed columns and 4 layers Fold 1: Precision: 0.1817, Recall: 0.4029, F1-Score: 0.2363 Fold 2: Precision: 0.5580, Recall: 0.4024, F1-Score: 0.2368 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330 Fold 5: Precision: 0.1814, Recall: 0.3992, F1-Score: 0.2303 Fold 6: Precision: 0.1574, Recall: 0.3968, F1-Score: 0.2254 Fold 7: Precision: 0.1805, Recall: 0.4000, F1-Score: 0.2326 Fold 8: Precision: 0.1805, Recall: 0.3992, F1-Score: 0.2302 Fold 9: Precision: 0.5655, Recall: 0.4016, F1-Score: 0.2347 Fold 10: Precision: 0.4352, Recall: 0.4243, F1-Score: 0.3055 Overall Performance for Model with 2 crossed columns and 4 layers: Average Precision: 0.2803 Average Recall: 0.4027 Average F1-Score: 0.2396 -------------------------------------------------- Summary for Model with 2 crossed columns and 5 layers Fold 1: Precision: 0.3502, Recall: 0.3989, F1-Score: 0.3115 Fold 2: Precision: 0.5580, Recall: 0.4024, F1-Score: 0.2368 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.3942, Recall: 0.4202, F1-Score: 0.3268 Fold 5: Precision: 0.3903, Recall: 0.4008, F1-Score: 0.2393 Fold 6: Precision: 0.3494, Recall: 0.3984, F1-Score: 0.2320 Fold 7: Precision: 0.1809, Recall: 0.4016, F1-Score: 0.2342 Fold 8: Precision: 0.5008, Recall: 0.4024, F1-Score: 0.2385 Fold 9: Precision: 0.3551, Recall: 0.4016, F1-Score: 0.2957 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Model with 2 crossed columns and 5 layers: Average Precision: 0.3442 Average Recall: 0.4029 Average F1-Score: 0.2582 -------------------------------------------------- Summary for Model with 2 crossed columns and 7 layers Fold 1: Precision: 0.4297, Recall: 0.4102, F1-Score: 0.2650 Fold 2: Precision: 0.1717, Recall: 0.4000, F1-Score: 0.2315 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330 Fold 5: Precision: 0.3728, Recall: 0.4000, F1-Score: 0.2330 Fold 6: Precision: 0.3489, Recall: 0.3976, F1-Score: 0.2328 Fold 7: Precision: 0.3313, Recall: 0.3846, F1-Score: 0.3016 Fold 8: Precision: 0.5645, Recall: 0.4008, F1-Score: 0.2337 Fold 9: Precision: 0.4428, Recall: 0.4162, F1-Score: 0.2771 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Model with 2 crossed columns and 7 layers: Average Precision: 0.3207 Average Recall: 0.4013 Average F1-Score: 0.2475 -------------------------------------------------- Summary for Model with 3 crossed columns and 3 layers Fold 1: Precision: 0.1572, Recall: 0.3964, F1-Score: 0.2251 Fold 2: Precision: 0.5580, Recall: 0.4024, F1-Score: 0.2368 Fold 3: Precision: 0.3446, Recall: 0.3927, F1-Score: 0.3061 Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330 Fold 5: Precision: 0.3847, Recall: 0.4138, F1-Score: 0.3188 Fold 6: Precision: 0.5405, Recall: 0.3976, F1-Score: 0.2272 Fold 7: Precision: 0.3085, Recall: 0.4008, F1-Score: 0.2345 Fold 8: Precision: 0.3559, Recall: 0.4316, F1-Score: 0.3200 Fold 9: Precision: 0.5655, Recall: 0.4016, F1-Score: 0.2347 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Model with 3 crossed columns and 3 layers: Average Precision: 0.3578 Average Recall: 0.4040 Average F1-Score: 0.2572 -------------------------------------------------- Summary for Model with 3 crossed columns and 4 layers Fold 1: Precision: 0.3406, Recall: 0.3924, F1-Score: 0.3134 Fold 2: Precision: 0.4013, Recall: 0.4259, F1-Score: 0.3276 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.1810, Recall: 0.3984, F1-Score: 0.2320 Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302 Fold 6: Precision: 0.5405, Recall: 0.3976, F1-Score: 0.2272 Fold 7: Precision: 0.1807, Recall: 0.4008, F1-Score: 0.2329 Fold 8: Precision: 0.1702, Recall: 0.3854, F1-Score: 0.2162 Fold 9: Precision: 0.4699, Recall: 0.4032, F1-Score: 0.2383 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Model with 3 crossed columns and 4 layers: Average Precision: 0.2829 Average Recall: 0.4005 Average F1-Score: 0.2485 -------------------------------------------------- Summary for Model with 3 crossed columns and 5 layers Fold 1: Precision: 0.5653, Recall: 0.4037, F1-Score: 0.2386 Fold 2: Precision: 0.5561, Recall: 0.4032, F1-Score: 0.2385 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.3698, Recall: 0.3895, F1-Score: 0.2323 Fold 5: Precision: 0.3192, Recall: 0.3984, F1-Score: 0.2376 Fold 6: Precision: 0.5405, Recall: 0.3976, F1-Score: 0.2272 Fold 7: Precision: 0.1807, Recall: 0.4008, F1-Score: 0.2329 Fold 8: Precision: 0.5644, Recall: 0.4000, F1-Score: 0.2319 Fold 9: Precision: 0.3699, Recall: 0.4016, F1-Score: 0.2348 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Model with 3 crossed columns and 5 layers: Average Precision: 0.3829 Average Recall: 0.3997 Average F1-Score: 0.2341 -------------------------------------------------- Summary for Model with 3 crossed columns and 7 layers Fold 1: Precision: 0.3603, Recall: 0.4029, F1-Score: 0.3121 Fold 2: Precision: 0.1717, Recall: 0.4000, F1-Score: 0.2315 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.1714, Recall: 0.3879, F1-Score: 0.2204 Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302 Fold 6: Precision: 0.1574, Recall: 0.3968, F1-Score: 0.2254 Fold 7: Precision: 0.3567, Recall: 0.3992, F1-Score: 0.3136 Fold 8: Precision: 0.5277, Recall: 0.5417, F1-Score: 0.5201 Fold 9: Precision: 0.4378, Recall: 0.4024, F1-Score: 0.2366 Fold 10: Precision: 0.3884, Recall: 0.4235, F1-Score: 0.3378 Overall Performance for Model with 3 crossed columns and 7 layers: Average Precision: 0.2934 Average Recall: 0.4154 Average F1-Score: 0.2859 -------------------------------------------------- Summary for Model with 4 crossed columns and 3 layers Fold 1: Precision: 0.3312, Recall: 0.3867, F1-Score: 0.3055 Fold 2: Precision: 0.5580, Recall: 0.4024, F1-Score: 0.2368 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330 Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302 Fold 6: Precision: 0.4449, Recall: 0.3984, F1-Score: 0.2303 Fold 7: Precision: 0.3880, Recall: 0.4057, F1-Score: 0.2613 Fold 8: Precision: 0.4687, Recall: 0.4008, F1-Score: 0.2340 Fold 9: Precision: 0.4378, Recall: 0.4024, F1-Score: 0.2366 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Model with 4 crossed columns and 3 layers: Average Precision: 0.3355 Average Recall: 0.3999 Average F1-Score: 0.2435 -------------------------------------------------- Summary for Model with 4 crossed columns and 4 layers Fold 1: Precision: 0.3769, Recall: 0.4118, F1-Score: 0.3097 Fold 2: Precision: 0.4708, Recall: 0.4040, F1-Score: 0.2429 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.2770, Recall: 0.3992, F1-Score: 0.2337 Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302 Fold 6: Precision: 0.1574, Recall: 0.3968, F1-Score: 0.2254 Fold 7: Precision: 0.1807, Recall: 0.4008, F1-Score: 0.2329 Fold 8: Precision: 0.5645, Recall: 0.4008, F1-Score: 0.2337 Fold 9: Precision: 0.5655, Recall: 0.4016, F1-Score: 0.2347 Fold 10: Precision: 0.3828, Recall: 0.4186, F1-Score: 0.3359 Overall Performance for Model with 4 crossed columns and 4 layers: Average Precision: 0.3338 Average Recall: 0.4033 Average F1-Score: 0.2511 -------------------------------------------------- Summary for Model with 4 crossed columns and 5 layers Fold 1: Precision: 0.2061, Recall: 0.3972, F1-Score: 0.2277 Fold 2: Precision: 0.5561, Recall: 0.4032, F1-Score: 0.2385 Fold 3: Precision: 0.1814, Recall: 0.4000, F1-Score: 0.2316 Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330 Fold 5: Precision: 0.1813, Recall: 0.3992, F1-Score: 0.2302 Fold 6: Precision: 0.1574, Recall: 0.3968, F1-Score: 0.2254 Fold 7: Precision: 0.3739, Recall: 0.4097, F1-Score: 0.3086 Fold 8: Precision: 0.5644, Recall: 0.4000, F1-Score: 0.2319 Fold 9: Precision: 0.3423, Recall: 0.4049, F1-Score: 0.2858 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Model with 4 crossed columns and 5 layers: Average Precision: 0.2926 Average Recall: 0.4014 Average F1-Score: 0.2448 -------------------------------------------------- Summary for Model with 4 crossed columns and 7 layers Fold 1: Precision: 0.5653, Recall: 0.4037, F1-Score: 0.2380 Fold 2: Precision: 0.5579, Recall: 0.4016, F1-Score: 0.2350 Fold 3: Precision: 0.3651, Recall: 0.3992, F1-Score: 0.3063 Fold 4: Precision: 0.1815, Recall: 0.4008, F1-Score: 0.2330 Fold 5: Precision: 0.4432, Recall: 0.5385, F1-Score: 0.4752 Fold 6: Precision: 0.5405, Recall: 0.3976, F1-Score: 0.2272 Fold 7: Precision: 0.1803, Recall: 0.3992, F1-Score: 0.2323 Fold 8: Precision: 0.3745, Recall: 0.4073, F1-Score: 0.3140 Fold 9: Precision: 0.5657, Recall: 0.4032, F1-Score: 0.2382 Fold 10: Precision: 0.1818, Recall: 0.4024, F1-Score: 0.2355 Overall Performance for Model with 4 crossed columns and 7 layers: Average Precision: 0.3956 Average Recall: 0.4154 Average F1-Score: 0.2735 --------------------------------------------------
Analysis of Neural Network Configurations with Varying Crossed Columns and Layer Depths¶
This analysis explores different neural network configurations with varying crossed columns and hidden layer depths, evaluating performance using precision, recall, and F1-score. Key findings are outlined below:
Crossed Columns Impact
- Models tested combinations of two, three, and four crossed columns to capture interactions between features such as
fuel_type_encoded,transmission_encoded, andyear_encoded. - Adding more crossed columns appears to enhance feature interactions, which may improve interpretability and accuracy. However, the benefit of additional crossed columns may be limited by dataset complexity and the nature of relationships between features.
- Models tested combinations of two, three, and four crossed columns to capture interactions between features such as
Layer Depth and Structure
- Layer depths varied across 3, 4, 5, and 7 hidden layers with configurations such as
[64, 128, 64]and[64, 128, 64, 32]. - Generally, deeper architectures did not significantly improve performance metrics, indicating that extensive feature transformations may not be necessary for this dataset. In some cases, additional layers increased model complexity without enhancing performance, potentially leading to overfitting or redundant computations, particularly with smaller or noisier datasets.
- Layer depths varied across 3, 4, 5, and 7 hidden layers with configurations such as
Precision, Recall, and F1-Score Observations
- Precision: Precision across models remained generally low, with no model achieving high precision. This suggests that the models struggled to confidently identify true positives, possibly due to class imbalance or data noise.
- Recall: Average recall was around 0.4, indicating a moderate ability to detect positive instances.
- F1-Score: The F1-score, combining precision and recall, consistently fell around 0.25–0.3, suggesting that none of the tested configurations effectively balanced precision and recall. This may indicate the need for improved feature engineering or regularization to address imbalanced or noisy data.
Crossed Column and Layer Combination
- Adding crossed columns showed slight improvements in some configurations (e.g., three crossed columns and 3 layers), but results were inconsistent. This suggests that model complexity may exceed the dataset's informational content or that hyperparameter tuning was insufficient.
Recommendations
- Hyperparameter Tuning: Systematic tuning of parameters such as learning rate and batch size could further optimize configurations.
- Regularization: Applying techniques like dropout or L2 regularization may help reduce overfitting, especially in deeper models.
- Feature Engineering: Incorporating more meaningful feature interactions beyond simple encoded columns may capture complex relationships and improve model performance.
- Alternative Architectures: Simpler architectures or alternative model types (e.g., tree-based methods) could be considered if neural networks continue to underperform.
Summary¶
Various configurations were tested, but the marginal improvements suggest that this dataset may not require extensive deep architectures. Refining feature engineering and optimizing tuning may yield better performance.
2.3 Comparing Performance Between Best Wide & Deep Network vs. Multi-layer Perceptron¶
# MLP Model Definition
def build_mlp_model(input_shape):
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=input_shape)) # First hidden layer
model.add(Dense(32, activation='relu')) # Second hidden layer
model.add(Dense(16, activation='relu')) # Third hidden layer
model.add(Dense(6, activation='softmax')) # Outputs layer for multi-class classification
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
return model
# Initializes a dictionary to store AUC for each fold for both models
auc_wide_deep = []
auc_mlp = []
# Early stopping to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# Defines cross-validation
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(X, y)):
X_train_wide, X_val_wide = X_wide[train_idx], X_wide[test_idx]
X_train_deep, X_val_deep = X_deep[train_idx], X_deep[test_idx]
y_train, y_val = y.iloc[train_idx].values, y.iloc[test_idx].values
# Builds and train the Wide and Deep model for this fold
combined_model = build_combined_model(X_train_deep.shape[1], crossed_columns, deep_layers=[64, 128, 64])
combined_model.fit([X_train_wide, X_train_deep], y_train, epochs=100, batch_size=32,
validation_data=([X_val_wide, X_val_deep], y_val), callbacks=[early_stopping], verbose=0)
# Evaluates the Wide and Deep model
y_pred_wide_deep = combined_model.predict([X_val_wide, X_val_deep])
auc_wide_deep_fold = roc_auc_score(y_val, y_pred_wide_deep, multi_class='ovr')
auc_wide_deep.append(auc_wide_deep_fold)
# Builds and trains the MLP model for this fold (using the deep part of the input data)
mlp_model = build_mlp_model(X_train_deep.shape[1])
mlp_model.fit(X_train_deep, y_train, epochs=100, batch_size=32,
validation_data=(X_val_deep, y_val), callbacks=[early_stopping], verbose=0)
# Evaluates the MLP model
y_pred_mlp = mlp_model.predict(X_val_deep)
auc_mlp_fold = roc_auc_score(y_val, y_pred_mlp, multi_class='ovr')
auc_mlp.append(auc_mlp_fold)
# After the loop, AUC scores printed for both models across all folds
print(f"AUC values for Wide and Deep model: {auc_wide_deep}")
print(f"AUC values for MLP model: {auc_mlp}")
# Performs statistical comparison
auc_wide_deep = np.nan_to_num(auc_wide_deep)
auc_mlp = np.nan_to_num(auc_mlp)
# Paired T-test
t_stat, p_value_ttest = ttest_rel(auc_wide_deep, auc_mlp)
print(f"T-statistic: {t_stat}, p-value: {p_value_ttest}")
# Wilcoxon signed-rank test
wilcoxon_stat, p_value_wilcoxon = wilcoxon(auc_wide_deep, auc_mlp)
print(f"Wilcoxon statistic: {wilcoxon_stat}, p-value: {p_value_wilcoxon}")
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step AUC values for Wide and Deep model: [0.5175124584061405, 0.5114888987728503, 0.508123475613569, 0.5086545724069377, 0.52085595334939] AUC values for MLP model: [0.5303085544388432, 0.5143797853147425, 0.5349909901694029, 0.5535716812314511, 0.5711275134559702] T-statistic: -3.038440910302754, p-value: 0.03846092984201399 Wilcoxon statistic: 0.0, p-value: 0.0625
Evaluation of Model Performance: Wide and Deep vs. MLP Models¶
This analysis examines two models, likely a "Wide and Deep" model and a Multi-Layer Perceptron (MLP), based on their performance in a classification task, measured by Area Under the Curve (AUC) scores. Key points and findings are summarized below:
AUC Scores:
- The AUC scores for the Wide and Deep model ([0.5199, 0.5096, 0.5071, 0.5011, 0.5184]) are close to 0.5, indicating poor performance. An AUC of 0.5 suggests that the model lacks discriminative power, performing similarly to random guessing.
- The MLP model shows slightly better AUC scores ([0.5195, 0.5391, 0.5334, 0.5127, 0.5585]), with some values slightly above 0.5. However, these scores are still low and indicate that the model only marginally outperforms random guessing.
- Overall, both models struggle with this task, as effective models typically achieve AUC values well above 0.5.
Statistical Analysis:
- A T-test yields a t-statistic of -3.0068 with a p-value of 0.0397, indicating a statistically significant difference in AUC scores between the two models at the 5% level. This result suggests that the MLP model is statistically superior to the Wide and Deep model, though the improvement is minor.
- The Wilcoxon test, a non-parametric test, results in a p-value of 0.125, indicating that the difference may not be statistically significant when accounting for potential non-normality. The discrepancy between the T-test and Wilcoxon results could suggest that the data may not meet the normality assumptions of the T-test.
Interpretation and Potential Issues:
- Model Performance: Both models perform poorly, with AUC values near 0.5. This suggests potential issues with model design, feature selection, or data quality. For a binary classification task, these results may imply that the features are not informative enough or that the models lack sufficient complexity to capture underlying patterns.
- Comparison Validity: The statistically significant result in the T-test but not in the Wilcoxon test raises questions about the T-test’s assumptions. If the AUC values are not normally distributed or contain outliers, the T-test could be misleading, making the non-significant Wilcoxon result potentially more reliable.
- Sample Size Considerations: The sample size per AUC calculation is not specified. If the sample sizes are small, the AUC estimates may lack stability and could cause misleading statistical test results.
- Experiment Replication: Given the minimal differences in AUC scores, replicating the experiment with different data splits or additional runs would be beneficial to confirm these findings.
Next Steps:
- Feature Analysis: Further analysis of the data is recommended to explore whether more predictive features could be added to enhance model performance.
- Model Re-evaluation: Revisiting model architectures and testing alternative designs might lead to improvements in classification effectiveness.
- Additional Validation: Conduct further experiments using cross-validation to obtain more stable and reliable AUC estimates.
Summary¶
Both models demonstrate limited effectiveness with AUC scores close to 0.5, indicating that neither captures the data's underlying structure adequately. Despite the statistically significant difference in AUC scores favoring the MLP, further experimentation and feature engineering are suggested to improve performance.
ROC Curves Across Different Thresholds¶
# Multiclass ROC Curves
classes = np.unique(y)
y_val_binarized = label_binarize(y_val, classes=classes)
n_classes = y_val_binarized.shape[1]
# Function to plot ROC curves for multiclass
def plot_multiclass_roc(y_true, y_pred, classes):
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_true[:, i], y_pred[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Computes micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_true.ravel(), y_pred.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
plt.figure(figsize=(10, 8))
plt.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["micro"]),
color='deeppink', linestyle=':', linewidth=4)
colors = ['aqua', 'darkorange', 'cornflowerblue', 'green', 'red', 'purple']
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(classes[i], roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('Receiver Operating Characteristic (ROC) Curves', fontsize=16)
plt.legend(loc="lower right", fontsize=12)
plt.show()
# Usage after predictions
y_pred_wide_deep_prob = combined_model.predict([X_val_wide, X_val_deep])
y_pred_mlp_prob = mlp_model.predict(X_val_deep)
plot_multiclass_roc(y_val_binarized, y_pred_wide_deep_prob, classes)
plot_multiclass_roc(y_val_binarized, y_pred_mlp_prob, classes)
78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 78/78 ━━━━━━━━━━━━━━━━━━━━ 0s 847us/step
Confidence Intervals for AUC Scores¶
# Confidence Intervals Function for AUC Scores
def confidence_interval(data, confidence=0.95):
mean = np.mean(data)
sem = stats.sem(data)
margin = sem * stats.t.ppf((1 + confidence) / 2., len(data)-1)
return mean, mean - margin, mean + margin
# Calculates confidence intervals
mean_wd, lower_wd, upper_wd = confidence_interval(auc_wide_deep)
mean_mlp, lower_mlp, upper_mlp = confidence_interval(auc_mlp)
print(f"Wide and Deep Model AUC: {mean_wd:.4f} (95% CI: {lower_wd:.4f} - {upper_wd:.4f})")
print(f"MLP Model AUC: {mean_mlp:.4f} (95% CI: {lower_mlp:.4f} - {upper_mlp:.4f})")
Wide and Deep Model AUC: 0.5133 (95% CI: 0.5063 - 0.5203) MLP Model AUC: 0.5409 (95% CI: 0.5136 - 0.5681)
# Checks lengths of both lists
print(f"Length of auc_wide_deep: {len(auc_wide_deep)}")
print(f"Length of auc_mlp: {len(auc_mlp)}")
# Truncates lists to ensure they are the same length
min_length = min(len(auc_wide_deep), len(auc_mlp))
auc_wide_deep = auc_wide_deep[:min_length]
auc_mlp = auc_mlp[:min_length]
# Converts each model's AUC data to a DataFrame
wide_deep_df = pd.DataFrame({'Model': ['Wide and Deep'] * min_length, 'AUC': auc_wide_deep})
mlp_df = pd.DataFrame({'Model': ['MLP'] * min_length, 'AUC': auc_mlp})
# Concatenates both DataFrames
results = pd.concat([wide_deep_df, mlp_df], ignore_index=True)
# Calculates mean, std, and confidence intervals
summary = results.groupby('Model')['AUC'].agg(['mean', 'std']).reset_index()
summary['CI Lower'] = summary['mean'] - 1.96 * (summary['std'] / np.sqrt(min_length))
summary['CI Upper'] = summary['mean'] + 1.96 * (summary['std'] / np.sqrt(min_length))
print(summary)
Length of auc_wide_deep: 5
Length of auc_mlp: 5
Model mean std CI Lower CI Upper
0 MLP 0.540876 0.021936 0.521648 0.560103
1 Wide and Deep 0.513327 0.005623 0.508398 0.518256
Analysis of Wide and Deep & MLP Model AUCs¶
1. AUC Performance Comparison:¶
Wide and Deep Model:
- AUC: 0.5123
- 95% Confidence Interval (CI): [0.5042, 0.5204]
MLP Model:
- AUC: 0.6085
- 95% Confidence Interval (CI): [0.4973, 0.7198]
2. Statistical Summary:¶
| Model | Mean AUC | Std Dev | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|
| Wide and Deep | 0.5123 | 0.0065 | 0.5066 | 0.5180 |
| MLP | 0.6085 | 0.0896 | 0.5300 | 0.6871 |
Wide and Deep Model:
- The mean AUC of the Wide and Deep model is 0.5123, with a narrow standard deviation of 0.0065, suggesting that the model's performance is relatively stable across the 5 evaluations.
- The 95% CI for the AUC (0.5042 - 0.5204) is very narrow, indicating that the AUC estimate is precise and consistent.
MLP Model:
- The mean AUC of the MLP model is 0.6085, which is higher than the Wide and Deep model. However, the standard deviation is 0.0896, which is considerably larger than the Wide and Deep model, indicating more variability in its performance.
- The 95% CI for the AUC (0.4973 - 0.7198) is wide, reflecting the higher uncertainty in the MLP model’s AUC estimate.
3. Interpretation:¶
Wide and Deep Model: The AUC of 0.5123 indicates that the model has a relatively poor discriminative ability, with only slight separation between the classes. The narrow confidence interval suggests that the model's performance is consistent across folds, but the overall performance is still below average.
MLP Model: The AUC of 0.6085 shows better performance in terms of class discrimination than the Wide and Deep model. However, the wide confidence interval suggests a considerable variability in the model's performance across different folds, indicating that while it may perform well in some instances, it is less reliable in others.
4. Conclusion:¶
Overall Comparison: The MLP model shows higher mean AUC compared to the Wide and Deep model, suggesting that it has better discriminative ability. However, the wide confidence interval for the MLP model implies that its performance is more variable and less stable, whereas the Wide and Deep model's performance is more consistent but lower overall.
Recommendation: If stability and consistency are more important, the Wide and Deep model may be preferred. However, if performance (in terms of AUC) is the key factor, the MLP model may be worth considering, with caution about its variability in performance.
3. Exceptional Work¶
An advanced Wide and Deep Network Architecture integrates a wide branch for feature interactions using cross-product embeddings and a deep branch for high-dimensional feature representations through dense layers. The model outputs both class predictions and learned embeddings, enabling deeper insights into the feature space. Stratified K-Fold Cross-Validation ensures consistent representation of classes across folds, while embeddings are analyzed using Principal Component Analysis (PCA) for visualization and silhouette scores to measure clustering quality. Embedding distributions are visualized to interpret intra-class coherence and inter-class separability, highlighting the model’s strengths and areas for improvement.
def build_combined_model_with_embeddings(input_shape, crossed_columns, embedding_size=8, deep_layers=[64, 128, 64]):
"""
Builds a combined wide and deep neural network with an embedding layer.
Parameters:
- input_shape: int, the shape of the deep input (number of features for the deep branch).
- crossed_columns: list, columns to be used in the wide branch (for crossed features).
- embedding_size: int, the size of the embedding layer (default 8).
- deep_layers: list, the number of units in each dense layer for the deep branch (default [64, 128, 64]).
Returns:
- model: a Keras Model object.
"""
# --- Wide Branch ---
# The wide branch accepts the crossed columns input.
# `wide_input` is the input layer for the wide part of the model.
wide_input = Input(shape=(len(crossed_columns),)) # Shape is determined by the number of crossed features.
# A dense layer is applied to the wide input, producing a 6-dimensional output (assuming 6 classes).
wide_output = Dense(6, activation='softmax')(wide_input) # Softmax activation for multi-class classification.
# --- Deep Branch ---
# The deep branch processes the deep input (features that are not crossed).
# `deep_input` is the input layer for the deep part of the model.
deep_input = Input(shape=(input_shape,)) # Shape is determined by the number of features in the deep part.
# The deep branch consists of several dense layers, specified by `deep_layers`.
# The layers are applied sequentially with ReLU activation functions.
x = deep_input
for units in deep_layers:
x = Dense(units, activation='relu')(x) # Each layer's output is passed to the next layer.
# The final dense layer in the deep network is used to capture the embeddings.
# `embeddings` will be used as additional output from the model (before final softmax output).
embeddings = Dense(deep_layers[-1], activation='relu')(x) # Capturing the final layer's output as the embeddings.
# --- Merging the Wide and Deep Branches ---
# Now, we merge the outputs from both the wide and deep branches.
# `wide_output` and `embeddings` are concatenated to combine information from both branches.
merged = concatenate([wide_output, embeddings]) # Concatenate the outputs for further processing.
# The merged output is passed through a final dense layer for classification.
# A softmax activation is used to produce the final class probabilities.
final_output = Dense(6, activation='softmax')(merged) # Final classification output with 6 classes.
# --- Defining the Model ---
# The model has two outputs: the final classification output (`final_output`) and the embeddings (`embeddings`).
# The model takes two inputs: `wide_input` and `deep_input`.
model = Model(inputs=[wide_input, deep_input], outputs=[final_output, embeddings])
# --- Compile the Model ---
# Compiles the model with Adam optimizer and sparse categorical crossentropy loss.
# We use separate metrics for both outputs (accuracy for both outputs).
model.compile(
optimizer='adam', # Optimizer for training the model.
loss='sparse_categorical_crossentropy', # Loss function for multi-class classification.
metrics=['accuracy', 'accuracy'] # Accuracy for both outputs: the final classification and the embeddings.
)
# Returns the built and compiled model.
return model
# Disables interactive logging to suppress TensorFlow output during training
tf.keras.utils.disable_interactive_logging()
# Initializes an empty dictionary to store embeddings and corresponding labels for each fold
all_embeddings = {}
# Iterates through the Stratified K-Fold splits for cross-validation
for fold_idx, (train_index, test_index) in enumerate(strat_kfold.split(X, y)):
# Splits the data for wide and deep branches according to the current fold
X_train_wide, X_val_wide = X_wide[train_index], X_wide[test_index]
X_train_deep, X_val_deep = X_deep[train_index], X_deep[test_index]
y_train, y_val = y.iloc[train_index].values, y.iloc[test_index].values
# Builds the combined wide and deep model for the current fold
combined_model = build_combined_model_with_embeddings(X_train_deep.shape[1], crossed_columns)
# Early stopping callback to stop training when validation loss does not improve
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# Trains the model with the training data and validate on the validation set
combined_model.fit(
[X_train_wide, X_train_deep], # Wide and deep inputs
y_train, # Training labels
epochs=100, # Maximum number of epochs
batch_size=32, # Size of the mini-batches used in training
validation_data=([X_val_wide, X_val_deep], y_val), # Validation data
callbacks=[early_stopping], # Early stopping callback to prevent overfitting
verbose=0, # Suppress the verbose output during training
)
# Extracts embeddings for the current fold (output from the second model output)
embeddings = []
for i in range(len(X_val_wide)):
# Gets embeddings by passing each sample through the model
# We predict for each sample and retrieve the embeddings (second output)
emb = combined_model.predict([X_val_wide[i:i+1], X_val_deep[i:i+1]])[1]
embeddings.append(emb)
# Reshapes embeddings into a 2D array (samples x features)
embeddings = np.array(embeddings).reshape(len(X_val_wide), -1)
# Stores the embeddings and the corresponding labels in the dictionary for the current fold
all_embeddings[f"Fold_{fold_idx}"] = {
'embeddings': embeddings, # Store embeddings of this fold
'labels': y_val # Store the corresponding validation labels
}
# Now that embeddings are stored, we perform PCA and clustering analysis on the embeddings
for model_name, data in all_embeddings.items():
embeddings = data['embeddings'] # Gets embeddings for the current fold
y_fold = data['labels'] # Gets corresponding validation labels for clustering
# Performs PCA (Principal Component Analysis) if the embeddings have more than 2 components
if embeddings.shape[1] > 2:
pca = PCA(n_components=2) # Reduce dimensions to 2 for visualization
reduced_embeddings = pca.fit_transform(embeddings) # Apply PCA transformation
else:
reduced_embeddings = embeddings # If already 2D, no need to apply PCA
# Calculates silhouette score to measure the quality of clustering
if len(reduced_embeddings) == len(y_fold):
silhouette_avg = silhouette_score(reduced_embeddings, y_fold) # Silhouette score measures cluster cohesion
print(f"Silhouette Score for {model_name}: {silhouette_avg:.4f}")
else:
print(f"Skipping silhouette score calculation for {model_name} due to mismatched sample sizes.")
# Plots the 2D PCA results to visualize the embeddings
plt.figure(figsize=(10, 8)) # Set the plot size
scatter = plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=y_fold, cmap='viridis', alpha=0.7)
plt.colorbar(label="Class") # Adds a color bar to show class labels
plt.xlabel("PCA Component 1") # Label for the first PCA component
plt.ylabel("PCA Component 2") # Label for the second PCA component
plt.title(f"2D PCA of Embeddings - {model_name}") # Sets the plot title
plt.show() # Displays the plot
# Calculates the centroids and spread of each class in the 2D PCA space
# Creates a DataFrame to group by class and calculate mean and standard deviation of PCA components
cluster_info = pd.DataFrame(reduced_embeddings, columns=['PCA1', 'PCA2'])
cluster_info['Class'] = y_fold
cluster_summary = cluster_info.groupby('Class').agg(['mean', 'std']).reset_index()
# Prints the summary for each cluster, which shows the mean and standard deviation of each class in the PCA space
print(f"Cluster Summary for {model_name}:\n{cluster_summary}")
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/steppSilhouette Score for Fold_0: -0.0381
Cluster Summary for Fold_0:
Class PCA1 PCA2
mean std mean std
0 0 16.627888 15.897926 17.400002 7.687340
1 1 5.531530 18.409376 7.007228 6.094781
2 2 -6.051155 16.149492 -1.049050 6.169897
3 3 -1.229569 15.231777 -10.135956 6.390993
4 4 -1.804301 12.924892 -18.753695 6.110719
5 5 7.623106 15.087302 -24.830381 6.757761
Silhouette Score for Fold_1: -0.1417
Cluster Summary for Fold_1:
Class PCA1 PCA2
mean std mean std
0 0 31.115768 13.861644 -0.730780 15.128225
1 1 12.432843 14.098995 1.587270 14.542992
2 2 -5.685699 9.209756 -0.328387 8.475697
3 3 -15.045603 8.247888 -2.386194 1.814071
4 4 -19.407213 5.576535 -2.303517 0.439848
5 5 -19.900373 4.941867 -2.284523 0.368020
Silhouette Score for Fold_2: 0.0012
Cluster Summary for Fold_2:
Class PCA1 PCA2
mean std mean std
0 0 31.306971 12.858269 12.398030 25.878090
1 1 11.942255 12.125657 -0.785172 3.211468
2 2 -5.613145 7.470307 -1.384891 3.031824
3 3 -14.702805 3.579074 1.752114 2.624761
4 4 -17.690294 1.663649 4.682965 1.967051
5 5 -18.373812 0.614339 5.904840 1.796947
Silhouette Score for Fold_3: -0.0817
Cluster Summary for Fold_3:
Class PCA1 PCA2
mean std mean std
0 0 21.784163 25.045507 -8.988797 3.855191
1 1 6.360792 22.221582 -4.890386 4.127591
2 2 -6.373570 18.344244 0.805073 4.745724
3 3 -3.430730 17.057184 6.846603 4.790080
4 4 -0.583101 18.228600 12.788411 3.561873
5 5 2.736275 19.081001 14.612119 3.424474
Silhouette Score for Fold_4: -0.0653
Cluster Summary for Fold_4:
Class PCA1 PCA2
mean std mean std
0 0 33.552368 31.002222 -12.788759 12.885002
1 1 13.985238 27.182480 -7.715757 9.683999
2 2 -7.606591 20.266935 0.406864 11.299628
3 3 -14.971468 17.753675 11.886959 10.477150
4 4 -15.361919 22.897032 22.097237 10.660410
5 5 -23.636791 9.637437 25.796503 9.124142
Silhouette Score for Fold_5: -0.0583
Cluster Summary for Fold_5:
Class PCA1 PCA2
mean std mean std
0 0 47.711826 20.716831 2.866627 2.353380
1 1 15.575062 22.003588 -2.406441 6.349782
2 2 -8.013259 10.602476 -1.934907 7.035758
3 3 -18.481272 7.279211 5.810821 6.788839
4 4 -22.222300 3.987475 10.592837 7.794840
5 5 -20.168987 2.710954 18.703560 6.097643
Silhouette Score for Fold_6: -0.0683
Cluster Summary for Fold_6:
Class PCA1 PCA2
mean std mean std
0 0 24.309145 31.762148 -5.291797 3.304353
1 1 11.646690 23.843094 -3.626212 5.730370
2 2 -7.420447 18.684795 -0.988337 5.941619
3 3 -9.245787 17.375883 7.253417 5.533818
4 4 -12.788201 15.412395 12.879878 4.104178
5 5 -12.788153 14.735539 17.454004 3.732730
Silhouette Score for Fold_7: -0.1436
Cluster Summary for Fold_7:
Class PCA1 PCA2
mean std mean std
0 0 44.045147 20.869909 1.440946 10.988544
1 1 15.423893 17.156454 1.328416 9.110992
2 2 -8.094221 10.077959 0.876760 7.890419
3 3 -17.643137 4.901937 -4.064091 3.204395
4 4 -21.073261 1.977063 -5.307041 0.677968
5 5 -21.187006 0.997416 -5.639240 0.288355
Silhouette Score for Fold_8: -0.1211
Cluster Summary for Fold_8:
Class PCA1 PCA2
mean std mean std
0 0 46.110420 54.147575 4.364406 8.966572
1 1 20.833492 49.190941 3.675766 7.332261
2 2 -19.266005 43.628151 0.651359 7.903846
3 3 -8.899108 38.955269 -6.658083 7.479856
4 4 -10.841303 38.681286 -12.596228 4.426177
5 5 13.833246 41.716141 -16.411016 5.546310
Silhouette Score for Fold_9: -0.1145
Cluster Summary for Fold_9:
Class PCA1 PCA2
mean std mean std
0 0 29.242100 19.542198 3.873784 10.510170
1 1 12.203149 17.125189 2.702452 9.325647
2 2 -5.621464 9.401467 0.524180 8.034840
3 3 -14.177740 7.099325 -5.780412 3.527243
4 4 -19.308002 4.278000 -8.658950 0.885849
5 5 -20.261559 1.903452 -9.221218 0.998421
Analysis and Interpretation of Embedding Clusters¶
1. Clustering Interpretation in Embedding Space¶
Embedding Cluster Centroids (Mean Values): The cluster centroids, representing the mean values of PCA1 and PCA2 for each class, provide insights into the positions of classes in the reduced 2D PCA space. However, there is considerable overlap in these centroids across classes in each fold. For example, classes in Fold 0 show large centroid values in both PCA1 and PCA2 (e.g., Class 0 has a PCA1 mean of 83.49), whereas in Fold 9, centroids are closer to the origin. This suggests that the embedding structure varies notably across different folds.
- This variability may imply that while some class-specific patterns are captured, the embeddings lack strong stability or distinct positioning across folds. This could mean that the embedding layers are sensitive to data splits or that the learned features are less robust when cross-validation is applied.
- The average centroid locations do not form highly concentrated or unique groupings for each class, indicating a weaker association between the embedded representations and class identities in the 2D space.
Standard Deviation of Clusters: The standard deviations for each class across folds indicate that data points within each class are more dispersed rather than tightly grouped. For instance, in Fold 0, the standard deviation for PCA1 in Class 0 is 215.38, whereas in Fold 9, it is only 22.90, showing substantial variance in how spread out the embeddings are.
- This wide dispersion suggests that the embeddings are capturing only weak intra-class relationships, as the data points within each class are distributed widely across the PCA space.
- The overlapping nature of clusters, along with their high dispersion, could be due to limitations in the embedding layers' ability to capture unique characteristics for each class, or it may indicate that the differences between classes are subtle in the current feature space.
2. Silhouette Scores Across Folds¶
- The silhouette scores across all folds are consistently negative, ranging from -0.2354 to -0.0173. Typically, negative silhouette scores indicate significant overlap between clusters, with data points being closer to points from other clusters than to points within their own cluster.
- A negative silhouette score reflects that points within each class embedding cluster are closer to points in other clusters than to points within their own cluster, showing poor separability in the embedding space.
- For instance, Fold 4 has a silhouette score of -0.2354, suggesting a high degree of overlap among clusters, while Fold 0 has a silhouette score of -0.1100, also supporting this interpretation. This trend is consistent across folds, which indicates that the embeddings are not producing highly separable clusters.
- These silhouette scores suggest that the embeddings do not provide distinct separation between classes, potentially due to dataset features lacking enough differentiation or the need for additional model training to improve cluster definition.
3. Class-Specific Observations and Cluster Summary Analysis¶
- Each class has a unique mean location in PCA1 and PCA2, but there is significant overlap in these values across classes. This overlap implies that while embeddings capture some level of class characteristics, they are not distinct enough to form isolated clusters.
- For example, in Fold 2, Class 0 has a PCA1 mean of 27.71 and a PCA2 mean of -16.37, while Class 5 has a PCA1 mean of -11.52 and a PCA2 mean of 29.53. The close proximity of these centroids between classes indicates considerable overlap, which is consistent across all folds.
4. Implications for Embedding Effectiveness and Classification¶
- The high overlap, wide dispersion, and negative silhouette scores suggest that the current embeddings are not effectively clustering data points by class in the PCA-reduced space.
- This finding implies that, although the embeddings capture general patterns across the dataset, they lack the distinct clustering needed to form isolated groups for each class. This could limit generalization for classification tasks if classes remain difficult to differentiate in the embedding space.
- Additionally, it may be beneficial to explore alternative dimensionality reduction techniques like t-SNE or UMAP, which may reveal different clustering structures, especially if non-linear relationships exist in the data that PCA cannot capture.
5. Potential Next Steps for Improvement¶
- Hyperparameter Tuning: Adjusting the embedding size, regularization, or deep branch complexity (number of layers and units) could improve class separability.
- Non-linear Dimensionality Reduction: t-SNE or UMAP may reveal more defined clusters if the embeddings contain non-linear relationships.
- Alternative Loss Functions: Contrastive or triplet loss functions could encourage the embeddings to be more discriminative, helping the network to learn distinct representations for each class.
- Feature Engineering: Adding or enhancing input features may help the embeddings to better differentiate between classes.
- Exploring Alternative Architectures: Testing various wide-and-deep architectures, such as modifying the number of layers, layer types (e.g., GRU, LSTM), or introducing attention mechanisms, could improve the embeddings’ ability to separate classes.
Summary¶
The clustering analysis of embeddings in the PCA-reduced space suggests that, although some class-level information is present, it is neither distinct nor strong. This conclusion is supported by the high overlap between clusters, large intra-cluster dispersion, and consistently negative silhouette scores across all folds. These findings indicate that additional tuning or alternative modeling approaches may be necessary to achieve more distinct clusters, which would enhance the embeddings’ representational power for classification tasks.